Multilingual corpora with annotated information play a vital role in linguistic research and natural language processing. For example, the sense-tagged English SemCor (Landes et al. 1998) has been applied in different tasks (Kilgarriff 1998). However, rarely any such resource is available for Asian languages. Based on Nanyang Technological University Multilingual Corpus (Tan & Bond 2011, Bond et al. 2013), we are building such a corpus for four languages English, Chinese, Japanese, and Indonesian in four genres (story, essay, news, tourism). Two major steps are taken: (i) monolingual sense tagging using respective wordnet; (ii) linking the concepts in each subcorpus to the English corpus. Two sets of tools are created for the two tasks. Many linguistic and practical issues arose during the construction process. This corpus has been used in the study of idiomatic expressions in Chinese (Ho et al. 2014). It can be used in many other tasks, such as contrastive study of languages, genre analysis, translation, and language learning.
|Publication status||Published - Dec 2014|