The computerised Cantonese corpus is the first corpus of mid-20th century Cantonese language in Hong Kong. This annotated corpus contains around 0.8 million characters of speech data of over 300 actors based on 70 Hong Kong movie dialogues. The rigorously processed corpus data (including segmentation and parts-of-speech tagging) can provide solid evidence, both quantitative and qualitative such as word frequency, and word association, which are important and essential for developing appropriate teaching and learning materials of the Chinese language in the context of Hong Kong. The corpus data can also be used in the development of Speech-To-Text or Text-To-Speech applications as well as chatbot systems.
Awarded date
Jun 2019
Granting Organisations
Silicon Valley International Invention Festival (SVIIF)