Slidesort

All pairs similarity search for short reads

Kana Shimizu, Koji Tsuda

Research output: Contribution to journalArticle

15 Citations (Scopus)

Abstract

Motivation: Recent progress in DNA sequencing technologies calls for fast and accurate algorithms that can evaluate sequence similarity for a huge amount of short reads. Searching similar pairs from a string pool is a fundamental process of de novo genome assembly, genome-wide alignment and other important analyses. Results: In this study, we designed and implemented an exact algorithm SlideSort that finds all similar pairs from a string pool in terms of edit distance. Using an efficient pattern growth algorithm, SlideSort discovers chains of common k-mers to narrow down the search. Compared to existing methods based on single k-mers, our method is more effective in reducing the number of edit distance calculations. In comparison to backtracking methods such as BWA, our method is much faster in finding remote matches, scaling easily to tens of millions of sequences. Our software has an additional function of single link clustering, which is useful in summarizing short reads for further processing.

Original languageEnglish
Article numberbtq677
Pages (from-to)464-470
Number of pages7
JournalBioinformatics
Volume27
Issue number4
DOIs
Publication statusPublished - 2011 Feb
Externally publishedYes

Fingerprint

Similarity Search
Edit Distance
Genes
Genome
Strings
DNA Sequencing
Backtracking
DNA
Exact Algorithms
DNA Sequence Analysis
Cluster Analysis
Alignment
Software
Clustering
Scaling
Processing
Technology
Evaluate
Growth

ASJC Scopus subject areas

  • Biochemistry
  • Molecular Biology
  • Computational Theory and Mathematics
  • Computer Science Applications
  • Computational Mathematics
  • Statistics and Probability
  • Medicine(all)

Cite this

Slidesort : All pairs similarity search for short reads. / Shimizu, Kana; Tsuda, Koji.

In: Bioinformatics, Vol. 27, No. 4, btq677, 02.2011, p. 464-470.

Research output: Contribution to journalArticle

Shimizu, Kana ; Tsuda, Koji. / Slidesort : All pairs similarity search for short reads. In: Bioinformatics. 2011 ; Vol. 27, No. 4. pp. 464-470.
@article{3cb36d865da84390ba8d8ce71f4d260f,
title = "Slidesort: All pairs similarity search for short reads",
abstract = "Motivation: Recent progress in DNA sequencing technologies calls for fast and accurate algorithms that can evaluate sequence similarity for a huge amount of short reads. Searching similar pairs from a string pool is a fundamental process of de novo genome assembly, genome-wide alignment and other important analyses. Results: In this study, we designed and implemented an exact algorithm SlideSort that finds all similar pairs from a string pool in terms of edit distance. Using an efficient pattern growth algorithm, SlideSort discovers chains of common k-mers to narrow down the search. Compared to existing methods based on single k-mers, our method is more effective in reducing the number of edit distance calculations. In comparison to backtracking methods such as BWA, our method is much faster in finding remote matches, scaling easily to tens of millions of sequences. Our software has an additional function of single link clustering, which is useful in summarizing short reads for further processing.",
author = "Kana Shimizu and Koji Tsuda",
year = "2011",
month = "2",
doi = "10.1093/bioinformatics/btq677",
language = "English",
volume = "27",
pages = "464--470",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "4",

}

TY - JOUR

T1 - Slidesort

T2 - All pairs similarity search for short reads

AU - Shimizu, Kana

AU - Tsuda, Koji

PY - 2011/2

Y1 - 2011/2

N2 - Motivation: Recent progress in DNA sequencing technologies calls for fast and accurate algorithms that can evaluate sequence similarity for a huge amount of short reads. Searching similar pairs from a string pool is a fundamental process of de novo genome assembly, genome-wide alignment and other important analyses. Results: In this study, we designed and implemented an exact algorithm SlideSort that finds all similar pairs from a string pool in terms of edit distance. Using an efficient pattern growth algorithm, SlideSort discovers chains of common k-mers to narrow down the search. Compared to existing methods based on single k-mers, our method is more effective in reducing the number of edit distance calculations. In comparison to backtracking methods such as BWA, our method is much faster in finding remote matches, scaling easily to tens of millions of sequences. Our software has an additional function of single link clustering, which is useful in summarizing short reads for further processing.

AB - Motivation: Recent progress in DNA sequencing technologies calls for fast and accurate algorithms that can evaluate sequence similarity for a huge amount of short reads. Searching similar pairs from a string pool is a fundamental process of de novo genome assembly, genome-wide alignment and other important analyses. Results: In this study, we designed and implemented an exact algorithm SlideSort that finds all similar pairs from a string pool in terms of edit distance. Using an efficient pattern growth algorithm, SlideSort discovers chains of common k-mers to narrow down the search. Compared to existing methods based on single k-mers, our method is more effective in reducing the number of edit distance calculations. In comparison to backtracking methods such as BWA, our method is much faster in finding remote matches, scaling easily to tens of millions of sequences. Our software has an additional function of single link clustering, which is useful in summarizing short reads for further processing.

UR - http://www.scopus.com/inward/record.url?scp=79951520746&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79951520746&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btq677

DO - 10.1093/bioinformatics/btq677

M3 - Article

VL - 27

SP - 464

EP - 470

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 4

M1 - btq677

ER -