Parallel corpora are critical resources for building many NLP applications, ranging from machine translation (MT) to cross-lingual information retrieval. In this chapter, we explore a new but important area involving patents by investigating the potential of cultivating large-scale parallel corpora from comparable multilingual patents. Two major issues are investigated on multilingual patents: (1) How to build large-scale corpora of comparable patents involving many languages? (2) How to mine high-quality parallel sentences from these comparable patents? Four parallel corpora are presented as examples, and some preliminary SMT experiments are reported. We further investigate and show the considerable potential of cultivating large-scale parallel corpora from multilingual patents for a wide variety of languages, such as English, Chinese, Japanese, Korean, German, etc, which would to some extent reduce the parallel data acquisition bottleneck in multilingual information processing. Copyright © 2013 Springer-Verlag Berlin Heidelberg.
|Title of host publication||Building and using comparable corpora|
|Editors||Serge SHAROFF, Reinhard RAPP, Pierre ZWEIGENBAUM, Pascale FUNG|
|Place of Publication||Berlin|
|Publication status||Published - 2013|
CitationLu, B., Chow, K. P., & Tsou, B. K. (2013). Comparable multilingual patents as large-scale parallel corpora. In S. Sharoff, R. Rapp, P. Zweigenbaum, & P. Fung (Eds.), Building and using comparable corpora (pp. 167-187). Berlin: Springer.
- Sentence alignment
- Multilingual patents
- PCT patents
- Parallel corpora
- Machine translation