We present a new suffix array construction algorithm that aims to build, in external memory, the suffix array for an input string of length n measured in the magnitude of tens of Giga characters over a constant or integer alphabet. The core of this algorithm is adapted from the framework of the original internal memory SA-DS algorithm that samples fixed-size d-critical substrings. This new external-memory algorithm, called EM-SA-DS, uses novel cache data structures to construct a suffix array in a sequential scanning manner with good data spatial locality: data is read from or written to disk sequentially. On the assumed externalmemory model with RAM capacity σ((nB)0.5), disk capacity O(n), and size of each I/O block B, allmeasured in log n-bit words, the I/O complexity of EM-SA-DS is O(n/B). This work provides a general cache-based solution that could be further exploited to develop external-memory solutions for other suffix-array-related problems, for example, computing the longest-common-prefix array, using a modern personal computer with a typical memory configuration of 4GB RAM and a single disk. Copyright © 2014 ACM.
|Journal||ACM Transactions on Information Systems|
|Publication status||Published - Jan 2014|
CitationNong, G., Chan, W. H., Zhang, S., & Guan, X. F. (2014). Suffix array construction in external memory using d-critical substrings. ACM Transactions on Information Systems, 32(1), Article 1.
- External memory
- Sorting algorithm
- Suffix array