TY - JOUR
T1 - Slidesort
T2 - All pairs similarity search for short reads
AU - Shimizu, Kana
AU - Tsuda, Koji
N1 - Funding Information:
Funding: Grant-in-Aid for Young Scientists (22700319, 21680025) by JSPS; FIRST program of the Japan Society for the Promotion of Science in part.
PY - 2011/2
Y1 - 2011/2
N2 - Motivation: Recent progress in DNA sequencing technologies calls for fast and accurate algorithms that can evaluate sequence similarity for a huge amount of short reads. Searching similar pairs from a string pool is a fundamental process of de novo genome assembly, genome-wide alignment and other important analyses. Results: In this study, we designed and implemented an exact algorithm SlideSort that finds all similar pairs from a string pool in terms of edit distance. Using an efficient pattern growth algorithm, SlideSort discovers chains of common k-mers to narrow down the search. Compared to existing methods based on single k-mers, our method is more effective in reducing the number of edit distance calculations. In comparison to backtracking methods such as BWA, our method is much faster in finding remote matches, scaling easily to tens of millions of sequences. Our software has an additional function of single link clustering, which is useful in summarizing short reads for further processing.
AB - Motivation: Recent progress in DNA sequencing technologies calls for fast and accurate algorithms that can evaluate sequence similarity for a huge amount of short reads. Searching similar pairs from a string pool is a fundamental process of de novo genome assembly, genome-wide alignment and other important analyses. Results: In this study, we designed and implemented an exact algorithm SlideSort that finds all similar pairs from a string pool in terms of edit distance. Using an efficient pattern growth algorithm, SlideSort discovers chains of common k-mers to narrow down the search. Compared to existing methods based on single k-mers, our method is more effective in reducing the number of edit distance calculations. In comparison to backtracking methods such as BWA, our method is much faster in finding remote matches, scaling easily to tens of millions of sequences. Our software has an additional function of single link clustering, which is useful in summarizing short reads for further processing.
UR - http://www.scopus.com/inward/record.url?scp=79951520746&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79951520746&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btq677
DO - 10.1093/bioinformatics/btq677
M3 - Article
C2 - 21148542
AN - SCOPUS:79951520746
VL - 27
SP - 464
EP - 470
JO - Bioinformatics
JF - Bioinformatics
SN - 1367-4803
IS - 4
M1 - btq677
ER -