The corpus of mid-20th century Hong Kong Cantonese (second phase) and its applications

Chi On CHIN, Alistair Michael TWEED

Research output: Contribution to conferencePaper


The Corpus of Mid-20th Century Hong Kong Cantonese (HKCC hereafter) is one of the very few Cantonese corpora that provides interactive spoken language data for Cantonese linguistic research. The first phase of HKCC was launched in 2012 with about 200,000 character tokens. The second phase of HKCC is much expanded with data from 60 movies, totaling about 800,000 character tokens.
While the primary purpose of the corpus was to support diachronic studies of Cantonese spoken half a century ago, the dialogic and interactive nature of the corpus data is also useful for other research issues. Besides basic information such as word lists, token frequency and sentences, HKCC, further processed by computer processing and analyses, can provide more useful and interesting quantitative and qualitative data. One such example is word collocation. In this talk, we will demonstrate how such information can be obtained from the second phase of HKCC, and its applications in Cantonese studies. Copyright © 2019 Workshop on Cantonese (WOC).
Original languageEnglish
Publication statusPublished - Apr 2019



Chin, A., & Tweed, A. (2019, April). The corpus of mid-20th century Hong Kong Cantonese (second phase) and its applications. Paper presented at the Workshop on Cantonese (WOC): Cantonese Study: An Empirical Approach, The Hong Kong Polytechnic University, Hong Kong, China.