Recovery From Random Samples in a Big Data Set
2015 (English)In: IEEE Communications Letters, ISSN 1089-7798, E-ISSN 1558-2558, Vol. 19, no 11, 1929-1932 p.Article in journal (Refereed) Published
Consider a collection of files, each of which is a sequence of letters. One of these files is randomly chosen and a random subsequence of the file is revealed. This random subsequence can be the result of a random sampling of the file. The goal is to recover the identity of the file, assuming a simple greedy matching algorithm to search the file collection. We study the fundamental limits on the maximum size of the file collection for reliable recovery in terms of the length of the random subsequence. The sequence of each file is assumed to follow a hidden Markov model (HMM), which is a common model for many data structures such as voice or DNA sequences. The connection between this problem and coding over a deletion channel with greedy decoders is discussed.
Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2015. Vol. 19, no 11, 1929-1932 p.
Hidden Markov model, deletion channel, greedy match, DNA sequences
Research subject Information and Communication Technology
IdentifiersURN: urn:nbn:se:kth:diva-189314DOI: 10.1109/LCOMM.2015.2478815ISI: 000365028900018ScopusID: 2-s2.0-84947724883OAI: oai:DiVA.org:kth-189314DiVA: diva2:945520
QC 201609012016-07-012016-07-012016-09-01Bibliographically approved