2025 UAAT TAIWAN-CANADA AFFINITY PROGRAM
Revitalising Taiwanese Hokkien through
Neural Language Technologies
A Global Collaborative Approach to Educational Learning and Multimodal AI
🇹🇼 Taiwan
🇨🇦 Canada
🤖 GenAI
Mission Statement
This multi-year project is co-led by Prof. Richard Tzong-Han Tsai (Taiwan) and Prof. Annie En-Shiun Lee (Canada) to combat the digital scarcity of low-resource languages. By uniting top institutions across both nations, we bridge the generational linguistic gap through high-quality machine translation resources and AI-powered multimodal learning tools like ATAIGI.
Core Research Thrusts
1. Improving Translation Quality
Goal: Overcoming data scarcity and "translationese" errors in Sinitic languages (Hokkien, Cantonese, Wu).
- Hokkien MT Error Dataset: Creating the first span-level error annotation dataset (MQM-derived) for fine-grained evaluation.
- Dual Translation Models: Developing models optimized for Hokkien (HAN/POJ) ↔ Mandarin/English using TAIDE-7B and LLaMA.
- Corpus Standardization: Leveraging orthographic similarities to standardize writing systems (HAN, POJ, HL).
- Tools: Deployment of TRANSLATIONCORRECT framework for rigorous annotation.
2. AI-Powered Language Learning
Goal: Creating an immersive, multimodal educational platform for language revitalization.
- ATAIGI App: A unified open-source app integrating translation, context generation, and transliteration.
- Multimodal GenAI: Generating contextual images (DALL-E 3) and audio to enhance vocabulary acquisition.
- 3D Avatar Chatbot: Integrating ASR (Whisper) and TTS for scenario-based roleplay and pronunciation practice.
- Pedagogical Validation: Rigorous user studies with psycholinguistic measures.
The Collaborative Team
Impact & Milestones
🚀 Key Achievements
- NAACL 2025 Demo Paper: "ATAIGI: An AI-Powered Multimodal Learning App Leveraging Generative Models for Low-Resource Taiwanese Hokkien". View Paper
- LREC-COLING 2024: "Enhancing Taiwanese Hokkien Dual Translation by Exploring and Standardizing of Four Writing Systems".
- SINITIC MT ERROR Dataset: Publicly released for the Sinitic NLP research community.
Supported by the UAAT Taiwan Affinity Program and the National Science and Technology Council (NSTC).