Comparable multilingual patents as large-scale parallel corpora

Bin LU, Ka Po CHOW, Ka Yin Benjamin TSOU

Research output: Chapter in Book/Report/Conference proceedingChapters


Parallel corpora are critical resources for building many NLP applications, ranging from machine translation (MT) to cross-lingual information retrieval. In this chapter, we explore a new but important area involving patents by investigating the potential of cultivating large-scale parallel corpora from comparable multilingual patents. Two major issues are investigated on multilingual patents: (1) How to build large-scale corpora of comparable patents involving many languages? (2) How to mine high-quality parallel sentences from these comparable patents? Four parallel corpora are presented as examples, and some preliminary SMT experiments are reported. We further investigate and show the considerable potential of cultivating large-scale parallel corpora from multilingual patents for a wide variety of languages, such as English, Chinese, Japanese, Korean, German, etc, which would to some extent reduce the parallel data acquisition bottleneck in multilingual information processing. Copyright © 2013 Springer-Verlag Berlin Heidelberg.
Original languageEnglish
Title of host publicationBuilding and using comparable corpora
EditorsSerge SHAROFF, Reinhard RAPP, Pierre ZWEIGENBAUM, Pascale FUNG
Place of PublicationBerlin
ISBN (Electronic)9783642201288
ISBN (Print)9783642201271
Publication statusPublished - 2013


Lu, B., Chow, K. P., & Tsou, B. K. (2013). Comparable multilingual patents as large-scale parallel corpora. In S. Sharoff, R. Rapp, P. Zweigenbaum, & P. Fung (Eds.), Building and using comparable corpora (pp. 167-187). Berlin: Springer.


  • Sentence alignment
  • Multilingual patents
  • PCT patents
  • Parallel corpora
  • Machine translation


Dive into the research topics of 'Comparable multilingual patents as large-scale parallel corpora'. Together they form a unique fingerprint.