A multilingual corpus: Its construction and application

Shan WANG, Francis BOND

Research output: Contribution to conferencePaper

Abstract

Multilingual corpora with annotated information play a vital role in linguistic research and natural language processing. For example, the sense-tagged English SemCor (Landes et al. 1998) has been applied in different tasks (Kilgarriff 1998). However, rarely any such resource is available for Asian languages. Based on Nanyang Technological University Multilingual Corpus (Tan & Bond 2011, Bond et al. 2013), we are building such a corpus for four languages English, Chinese, Japanese, and Indonesian in four genres (story, essay, news, tourism). Two major steps are taken: (i) monolingual sense tagging using respective wordnet; (ii) linking the concepts in each subcorpus to the English corpus. Two sets of tools are created for the two tasks. Many linguistic and practical issues arose during the construction process. This corpus has been used in the study of idiomatic expressions in Chinese (Ho et al. 2014). It can be used in many other tasks, such as contrastive study of languages, genre analysis, translation, and language learning.
Original languageEnglish
Publication statusPublished - Dec 2014

Fingerprint

Linguistics
Processing

Citation

Wang, S., & Bond, F. (2014, December). A multilingual corpus: Its construction and application. Paper presented at The Linguistic Society of Hong Kong - Annual Research Forum 2014 (LSHK-ARF 2014), City University of Hong Kong, Hong Kong, China.