Distributed Algorithms for Building and Using Big Suffix-based Fulltext Indices

    Project: Research project

    Project Details


    Indexing of data is the very first and crucial step in many data processing applications. Suffix-based indices, such as suffix tree, suffix array, BWT (Burrows-Wheeler Transform) and FM-Index, have become the indispensable fundamental index structures for full-text search of string data, especially for use in bioinformatics. The aim of this project is to explore, by utilizing some very recent results on external memory suffix sorting, new algorithms that would be able to build and use big suffix-based full-text indices to process massive string data at a high throughput, and to develop the essential system technologies for linking these algorithms to applications. The algorithms to be developed are universal for full-text retrievals of both genome and text data.
    Recently, we have successfully extended our well-known internal memory algorithms SA-IS and SA-DS to the external memory domain, resulting in two algorithms called DSA-IS and EM-SA-DS for sorting suffixes in external memory. The experimental results collected show that the induced sorting technique can work efficiently in both internal memory and external memory. With these new developments, we are now able to use a computer with a massive disk to build big suffix trees and suffix arrays with much better time and space efficiency. As the next move, we need to consider how to further develop efficient distributed algorithms to scale the problem’s size and solving speed.
    The first step in this research is to design a general-purpose distributed algorithm for building a big suffix tree using a cluster of computers in a high-speed local area network (LAN) such as Gigabit Ethernet. The second step is to further optimize the general-purpose distributed suffix tree construction algorithm for building a k-order truncated suffix tree, in which each suffix is truncated to a fixed length of at most k characters, where k is far less than n for a size-n input string when n is large. Once a big suffix tree (full-order or truncated) has been efficiently built, the third step is to explore new efficient algorithms for using big suffix trees to provide the capability of full-text search in large-scale genome and text databases. Finally, all the source code in C++ for the algorithms developed in the previous steps will be engineered as a library for project achievement demonstration and use by third-party applications.
    Effective start/end date01/01/1631/12/18