This work reports on the construction of a corpus of connected spoken Cantonese. The corpus aims at providing an additional resource for the study of modern (Hong Kong) Cantonese and also involves different controlled elicitation tasks which will serve different projects related to the phonology and semantics of Cantonese. The corpus offers recordings, phonemic transcription, Chinese characters transcription and will offer a narrow phonetic transcription layer as well. The corpus will be distributed under the CC-BY-SA 4.0 International license. Apart from the linguistic insights to be gathered from the recorded data, the construction of the corpus is also innovative in that it uses out of the box software to facilitate and cheapen the cost of the transcription. Previous works on Cantonese corpora include Leung and Law’s (2001) The Hong Kong Cantonese Adult Language Corpus (HKCAC), Luke and Wong’s (2015) The Hong Kong Cantonese Corpus (HKCanCorp) and Chin’s (2015) Linguistics Corpus of Mid-20th Century Hong Kong Cantonese. These corpora were either phonetically transcribed with IPA, phonemically transcribed and glossed with parts of speech or transcribed with Chinese characters. Although they are precious resources, these corpora are not annotated in the same way, and their size is not suited to some recent data-intensive developments in the field of NLP. Our corpus thus come as an additional resource for such usages by providing 13 hours of natural connected Cantonese speech. Besides providing additional data, one aim of our corpus was to elicit the realization of specific tonal sequences and other discursive features in natural connected speech. In order to do so, the project replicated part of the setting used in the HCRC MapTask corpus (Anderson et al., 1991). The material in the corpus thus corresponds to conversations between two participants with asymmetrical roles: the Instructor and the Receiver. Both participants were given maps with various landmarks and names indicated in Chinese characters. The maps given to the participants differed in some controlled aspects (the choice of landmarks and the names of some landmarks). The instructor’s map showed a path that the Receiver had to replicate on his map following the instructions given verbally by the instructor. The participants were unable to make eyecontact, but otherwise were free to communicate in any way they wanted. Each participant was recorded using two Sony PCM-D100 recorders and the data was saved in wav format. Another original aspect of our work is how our data was transcribed. The transcription of the spoken data was done by relying on the Google Cloud Speech API which offers a robust automatic transcription software environment. While not originally designed to transcribe 16 large amounts of data, we developed a python script which allows it. The accuracy of the software were evaluated by comparing the automatically transcribed data with manually transcribed ones. Preliminary results indicate that the accuracy is high and the main errors found so far are the omission or misidentification of sentence-final particles and nonce words. Given these results, we recommend researchers who have limited funding to jump start the laborious and time-consuming task of transcription with this free tool, and the transcription can be fine-tuned manually and corrected if necessary in the step that follows.
|Publication status||Published - Dec 2016|
Pulse code modulation
Application programming interfaces (API)