Suffix Array-based Index for Use in Bioinformatics

    Project: Research project

    Project Details

    Description

    In 1990, Manber and Myers proposed suffix arrays (SAs) — a space-efficient alternative for suffix trees. For an n-character text S, its suffix array SA(S) is an array of pointers, which requires O(nlogn)-bit space, for all the suffixes in S sorted in lexicographically ascending order. Since then, suffix arrays have been widely adopted in data indexing, retrieving, sorting and processing. As suffix arrays are used in a broad range of applications in stringology and other related fields such as compression, string matching and computational biology, constructing suffix arrays became a major part of the foundation for subsequent tasks. In the past few years, it has been observed that many applications for suffix arrays involve huge inputs of text that are measured in billions of characters (e.g. biology genome database). In order to facilitate the increasing number of tasks which have to deal with huge databases, researchers have started intensive studies on time-and-space efficient suffix array construction algorithms (SACAs). Moreover, as sequence searching has become a very common task in many applications over different areas, low-cost settings for building SAs would be essential. Hence, another direction of research for SAs is the construction of huge SAs using external memory. This would allow the construction of huge SAs with low cost general purpose computers.
    Recently, based on the recursive reduction mechanism with fixed-length substrings and variable-length substrings respectively, we introduce two SACAs — SA-DS and SA-IS. With their distinct features and sophisticated sorting techniques, SA-DS has been shown as the most space-efficient while SA-IS the most time-efficient among all the existing linear- and superlinear SACAs. In this project, we will investigate the design variants of our algorithms using external memory that makes the algorithms applicable for general purpose computers to build suffix arrays for huge databases. Also, we will advance the elegant sorting strategies that were proposed in the invention of our algorithms to further simplify the algorithms for reconstructing SAs. More specifically, we will focus on developing scalable SA-based indices that seek to meet the demands from emerging bioinformatics applications. Finally, we will build a library in C++ for the developed algorithms and a test-bed system to demonstrate how to use the advantageous algorithms in our library to build third party applications.
    StatusFinished
    Effective start/end date01/01/1331/12/15

    Keywords

    • suffix array
    • suffix tree
    • bioinformatics
    • indexing
    • linear-time algorithm