Comparable multilingual patents as large-scale parallel corpora

Bin LU, Ka Po CHOW, Ka Yin Benjamin TSOU

Research output: Chapter in Book/Report/Conference proceedingChapter

Abstract

Parallel corpora are critical resources for building many NLP applications, ranging from machine translation (MT) to cross-lingual information retrieval. In this chapter, we explore a new but important area involving patents by investigating the potential of cultivating large-scale parallel corpora from comparable multilingual patents. Two major issues are investigated on multilingual patents: (1) How to build large-scale corpora of comparable patents involving many languages? (2) How to mine high-quality parallel sentences from these comparable patents? Four parallel corpora are presented as examples, and some preliminary SMT experiments are reported. We further investigate and show the considerable potential of cultivating large-scale parallel corpora from multilingual patents for a wide variety of languages, such as English, Chinese, Japanese, Korean, German, etc, which would to some extent reduce the parallel data acquisition bottleneck in multilingual information processing. Copyright © 2013 Springer-Verlag Berlin Heidelberg.
Original languageEnglish
Title of host publicationBuilding and using comparable corpora
EditorsSerge SHAROFF, Reinhard RAPP, Pierre ZWEIGENBAUM, Pascale FUNG
Place of PublicationBerlin
PublisherSpringer
Pages167-187
ISBN (Electronic)9783642201288
ISBN (Print)9783642201271
DOIs
Publication statusPublished - 2013

Fingerprint

information processing
data acquisition
patent
resource
experiment

Citation

Lu, B., Chow, K. P., & Tsou, B. K. (2013). Comparable multilingual patents as large-scale parallel corpora. In S. Sharoff, R. Rapp, P. Zweigenbaum, & P. Fung (Eds.), Building and using comparable corpora (pp. 167-187). Berlin: Springer.

Keywords

  • Sentence alignment
  • Multilingual patents
  • PCT patents
  • Parallel corpora
  • Machine translation