FLASH: a fast look-up algorithm for string homology

Andrea Califano; I. Rigoutsos

CSCCVPR 1993

Conference paper

01 Dec 1993

FLASH: a fast look-up algorithm for string homology

Abstract

We are in the middle of a long-range worldwide race to map and sequence the genome of Homo sapiens and that of many other living creatures. About 108 nucleotides and aminoacids of mammals, primates, rodents, bacteria, and other life forms have already been classified and stored in publicly available database such as Genbank. By the end of the century we should be close to the estimated mark of 1 billion nucleotides. A key issue in managing such large amounts of data is the availability of efficient, accurate, and selective techniques to detect homology (similarity) between newly recovered and previously acquired sequences. Unfortunately, even today's most advanced algorithms such as Smith-Waterman, FASTA and BLAST, are designed to scan the contents of the entire database for one or more matches. This results in long search times or in a sharp tradeoff between accuracy and amount of computation. The algorithm we present here is based on a probabilistic indexing framework which requires minimal access to the database for such match. A highly redundant number of descriptive tuples from the sequences of interest are generated and used as indices in table look-up paradigm. Theoretical and experimental results on the sensitivity and accuracy of the approach are provided. This includes the probability of correct and random matches and the storage and computational requirements. An experimental system has been implemented for a database containing the complete genome of the bacteria E.Coli (approximately 2 million nucleotides). The system is being expanded to include the complete Genbank database. Search time is of a few seconds on a workstation class machine. The algorithm is shown to scale well to databases containing billions of nucleotides with performances that are orders of magnitude better even than BLAST, the fastest of the current techniques. The approach is of a very general nature and is being applied to a large number of other domains and data topologies such as speech recognition, sound and image databases.

Conference paper

Pattern discovery on character sets and real-valued data: Linear bound on irredundant motifs and an efficient polynomial time algorithm

L.P. Parida, I. Rigoutsos, et al.

SODA 2000

Paper

Functional classification of proteins by pattern discovery and top-down clustering of primary sequences

A.H. Liu, Andrea Califano

IBM Systems Journal

Conference paper

Systematic and automated discovery of patterns in PROSITE families

Reece Hart, A. Royyuru, et al.

RECOMB 2000

Paper

Building dictionaries of 1D and 3D motifs by mining the Unaligned 1D sequences of 17 archaeal and bacterial genomes.

I. Rigoutsos, Y. Gao, et al.

Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology

View all publications

Abstract

Related

Pattern discovery on character sets and real-valued data: Linear bound on irredundant motifs and an efficient polynomial time algorithm

Functional classification of proteins by pattern discovery and top-down clustering of primary sequences

Systematic and automated discovery of patterns in PROSITE families

Building dictionaries of 1D and 3D motifs by mining the Unaligned 1D sequences of 17 archaeal and bacterial genomes.