CUHK
News Centre
CUHK Faculty of Engineering Develops Hong Kong’s First Automatic Colloquialism and Typo Detection System for Chinese Language
A system called “Automatic Colloquialism and Typo Detection System for Chinese Language” has recently been developed by a research team led by Prof. Wong Kam Fai, Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong (CUHK). This Automatic Chinese Detection System is the first of its kind in Hong Kong and targets local students. The pilot system has already been tested among language teachers and local primary and secondary school students, and proven effective in enhancing Chinese language teaching and learning.
Colloquial expressions, abbreviations, slang, puns, and emoji have become integrated into daily internet language for teenagers, while social media continues to play an important role in online interactions. Some believe that such internet language has lowered the Chinese language writing proficiency of many Hong Kong students. In order to address this problem, the CUHK team proposed an efficient and user-friendly “Automatic Colloquialism and Typo Detection System for Chinese Language” system through algorithms based on grand-scale Cantonese language data mining, in-depth calculation, and classification.
The pilot system was applied to articles containing a hundred to a thousand words written by primary and secondary school students, and it took just a few seconds to detect almost every error with an extremely low error rate. It also offered replacement suggestions for every typo and Cantonese colloquialism, and offered explanations for many of the Cantonese colloquialisms. The team hopes that in the future the system will be promoted to more primary and secondary schools in Hong Kong as a fun, easy-to-use and useful tool in Chinese language education for students and teachers, and that it will enhance students’ language proficiency. The team envisions a brand new add-on for office software (such as Microsoft Office), which may become available for public use within this year. The system was recently showcased at the 2017 China Innovation and Entrepreneurship Fair.
More accurately identified typos through an intelligent algorithm
The system is divided into two parts: “Typo Detection” and “Cantonese Detection”. After entering a Chinese sentence or chapter, the system first uses the “Typo Detection Module”, based on “Segmentation” and “Part-of-speech tagging”, to automatically search for any words that do not fit in the meaning of the sentence. Although some research units also use similar logic for error detection, many common words, such as “的”, “地”, and “是”, are easily misjudged as typos because of the limitations of existing algorithms. Based on big data and deep learning, and in conjunction with a unique intelligent algorithm, the CUHK system is able to identify colloquial usage and inversions in Cantonese language. The team also constructed a confusion set containing over 60,000 Chinese words. With scores assigned to potential corrections, users are always offered the most suitable replacements.
Changing students’ habit of using colloquial expressions
The “Cantonese Detection Module” is a unique feature that detects whether a sentence contains Cantonese colloquial expressions based on a large Cantonese dictionary containing more than 12,000 words, which is still being expanded and optimised. For example, Cantonese language users are inclined to use “鍾意” instead of “喜歡” in the context of preferences. The Detection Module also adopts a rule-based system, the Cantonese linguistic rules of which were described by the team using Part-of-speech tagging. Accordingly, the team built a number of rules as a start for basic Chinese sentence structures. Furthermore, the system identifies the usage of quantifiers, such as “一條魚/一尾魚”, simplified Chinese, and inversions in sentences, such as “緊要/要緊”.
The research group of Professor Wong specialises in the research areas of Natural Language Processing, Web Mining, Rumor Detection, etc. “We chose Cantonese because of its many special and sophisticated properties, such as a unique grammar system and many colloquial terms, all of which makes the task of detecting errors extremely challenging. We hope that our work will ultimately promote and facilitate Chinese language learning,” said Professor Wong.
“It is difficult to develop a universal language typographical detection system as the use of language evolves over time and space, but we aim at a system with improved detection and performance. Deep learning and artificial intelligence are well adopted in the system such that we keep upgrading the word bank and grammar rules based on the changing needs and special requirements of the users and language teachers,” said Dr. Gabriel Fung, Research Fellow, Department of Systems Engineering and Engineering Management, CUHK.