Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Parallel Community Detection For Cross-Document Coreference
KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
KTH, School of Electrical Engineering (EES), Communication Networks. (LCN)ORCID iD: 0000-0003-4516-7317
KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.ORCID iD: 0000-0002-6718-0144
2014 (English)Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents a highly parallel solution for cross-document coreference resolution, which can deal with billions of documents that exist in the current web. At the core of our solution lies a novel algorithm for community detection in large scale graphs. We operate on graphs which we construct by representing documents' keywords as nodes and the co-location of those keywords in a document as edges. We then exploit the particular nature of such graphs where coreferent words are topologically clustered and can be efficiently discovered by our community detection algorithm. The accuracy of our technique is considerably higher than that of the state of the art, while the convergence time is by far shorter. In particular, we increase the accuracy for a baseline dataset by more than 15% compared to the best reported result so far. Moreover, we outperform the best reported result for a dataset provided for the Word Sense Induction task in SemEval 2010.

Place, publisher, year, edition, pages
IEEE , 2014. 46-53 p.
National Category
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-145360DOI: 10.1109/WI-IAT.2014.79ISI: 000365543800007Scopus ID: 2-s2.0-84912558916ISBN: 978-147994143-8 (print)OAI: oai:DiVA.org:kth-145360DiVA: diva2:717994
Conference
2014 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Workshops, WI-IAT 2014; University of WarsawWarsaw; Poland; 11 August 2014 - 14 August 2014
Note

Updated from manuscript to conference paper.

QC 20150108

Available from: 2014-05-19 Created: 2014-05-19 Last updated: 2016-01-08Bibliographically approved
In thesis
1. Gossip-based Algorithms for Information Dissemination and Graph Clustering
Open this publication in new window or tab >>Gossip-based Algorithms for Information Dissemination and Graph Clustering
2014 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Decentralized algorithms are becoming ever more prevalent in almost all real-world applications that are either data intensive, computation intensive or both. This thesis presents a few decentralized solutions for large-scale (i) data dissemination, (ii) graph partitioning, and (iii) data disambiguation. All these solutions are based on gossip, a light weight peer-to-peer data exchange protocol, and thus, appropriate for execution in a distributed environment.

For efficient data dissemination, we make use of the publish/subscribe communication model and provide two distributed solutions, one for topicbased and one for content-based subscriptions, named Vitis and Vinifera respectively. These systems propagate large quantities of data to interested users with a relatively low overhead. Without any central coordinator and only with the use of gossip, we build a novel topology that enables efficient routing in an unstructured overlay. We construct a hybrid system by injecting structure into an otherwise unstructured network. The resulting structure resembles a navigable small-world network that spans along clusters of nodes that have similar subscriptions. The properties of such an overlay make it an ideal platform for efficient data dissemination in large-scale systems. Our solutions significantly outperforms their counterparts on various subscription and churn scenarios, from both synthetic models and real-world traces.

We then investigate how gossiping protocols can be used, not for overlay construction, but for operating on fixed overlay topologies, which resemble graphs. In particular we study the NP-Complete problem of graph partitioning and present a distributed partitioning solution for very large graphs. This solution, called Ja-be-Ja, is based on local search and does not require access to the entire graph simultaneously. It is, therefore, appropriate for graphs that can not even fit into the memory of a single computer. Once again gossip-based algorithms prove efficient as they enable implementing light-weight peer sampling services, which supply graph nodes with partial knowledge about other nodes in the graph. The performance of our partitioning algorithm is comparable to centralized graph partitioning algorithms, and yet it is scalable and can be executed on several machines in parallel or even in a completely distributed peer-to-peer overlay. It can be used for both edge-cut and vertex-cut partitioning of graphs and can produce partition sizes of any given distribution.

We further extend the use of gossiping protocols to find natural clusters in a graph instead of producing a given number of partitions. This problem, known as graph community detection, has extensive application in various fields and communities. We take the use of our community detection algorithm to the realm of linguistics and address a well-known problem of data disambiguation. In particular, we provide a parallel community detection algorithm for cross-document coreference problem. We operate on graphs that we construct by representing documents’ keywords as nodes and the co-location of those keywords in a document as edges. We then exploit the particular nature of such graphs, which is coreferent words are topologically clustered, and thus, can be efficiently discovered by our community detection algorithm.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2014. x, 22 p.
Series
TRITA-ICT-ECS AVH, ISSN 1653-6363 ; 14:09
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-145361 (URN)978-91-7595-108-9 (ISBN)
Public defence
2014-05-22, Sal/Hall D, KTH - ICT, Isafjordsgatan 39, Kista, 13:00 (English)
Opponent
Supervisors
Note

QC 20140519

Available from: 2014-05-19 Created: 2014-05-19 Last updated: 2014-05-19Bibliographically approved

Open Access in DiVA

No full text

Other links

Publisher's full textScopus

Authority records BETA

Haridi, Seif

Search in DiVA

By author/editor
Rahimian, FatemehGirdzijauskas, SarunasHaridi, Seif
By organisation
Software and Computer systems, SCSCommunication Networks
Computer Science

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 61 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf