Cultivation of a very large monitoring corpus of Chinese: Some methodological considerations and applications

Ka Yin Benjamin TSOU

Research output: Contribution to conferencePaper

Abstract

Advancements in computer science and related technology as well as the Internet have enabled the unprecedented cultivation and curation of large amount of language data for expanding applications in many domains. We shall focus on research relating to an usual monitoring corpus from 3 perspectives: (1) The methodological considerations underlying research which is based on Project LIVAC (Linguistic Variations in Chinese Speech Communities Synchronous Corpus) (http://livac.org). Since 1995, the project has regularly and rigorously sampled and analyzed more that 450 million Chinese characters of representative media texts from major Chinese speech communities such as Beijing, Hong Kong, Macau, Shanghai, Singapore and Taiwan. The analysis was predicated on the successful handling of problematical tokenization of the Chinese texts which are represented by continuous strings of logographic characters as well as POS tagging, and has managed to cull an unusually large database of 1.6 million word types in the LIVAC corpus. This database has allowed us to compare English and other Western Alphabetic languages with Chinese in terms of entropy, a measure of the efficacy and efficiency of the encoding and management of information content, and to bootstrap the parallel alignment of comparable Chinese and English texts. (2) Given the synchronous and homothematic nature of the corpora material, we have been able to monitor and analyze some salient aspects of grammatical innovations and cognitive aspects of naturalistic classification, as well as to enhance the sentiment analysis of Chinese press coverage of US presidential election. (3) The application of the authoritative language characteristics to examine issues related to threshold literacy in Chinese, to the construction of language assessment tools, and to determining readability in Chinese. The research horizon in linguistics has increasingly gone beyond the idealized speaker(s). It will be argued that while a large monitoring synchronous corpus of comparable size may not be easily cultivated, similar corpus on the basis of even a single community for a variety of languages may be quite readily and profitably attempted with naturally occurring texts, collected and sampled systematically and annotated appropriately for analysis.
Original languageEnglish
Publication statusPublished - 2013

Fingerprint

monitoring
language
community
linguistics
information content
entropy
presidential election
computer science
Singapore
Hong Kong
Taiwan
coverage
literacy
innovation
Internet
efficiency
management

Citation

Tsou, B. K. (2013, January). Cultivation of a very large monitoring corpus of Chinese: Some methodological considerations and applications. Paper presented at the American Association for Corpus Linguistics 2013, San Diego, California.