2025 UAAT TAIWAN-CANADA AFFINITY PROGRAM

Revitalising Taiwanese Hokkien through
Neural Language Technologies

A Global Collaborative Approach to Educational Learning and Multimodal AI

🇹🇼 Taiwan 🇨🇦 Canada 🤖 GenAI

Mission Statement

This multi-year project is co-led by Prof. Richard Tzong-Han Tsai (Taiwan) and Prof. Annie En-Shiun Lee (Canada) to combat the digital scarcity of low-resource languages. By uniting top institutions across both nations, we bridge the generational linguistic gap through high-quality machine translation resources and AI-powered multimodal learning tools like ATAIGI.

Core Research Thrusts

1. Improving Translation Quality

Goal: Overcoming data scarcity and "translationese" errors in Sinitic languages (Hokkien, Cantonese, Wu).


  • Hokkien MT Error Dataset: Creating the first span-level error annotation dataset (MQM-derived) for fine-grained evaluation.
  • Dual Translation Models: Developing models optimized for Hokkien (HAN/POJ) ↔ Mandarin/English using TAIDE-7B and LLaMA.
  • Corpus Standardization: Leveraging orthographic similarities to standardize writing systems (HAN, POJ, HL).
  • Tools: Deployment of TRANSLATIONCORRECT framework for rigorous annotation.

2. AI-Powered Language Learning

Goal: Creating an immersive, multimodal educational platform for language revitalization.


  • ATAIGI App: A unified open-source app integrating translation, context generation, and transliteration.
  • Multimodal GenAI: Generating contextual images (DALL-E 3) and audio to enhance vocabulary acquisition.
  • 3D Avatar Chatbot: Integrating ASR (Whisper) and TTS for scenario-based roleplay and pronunciation practice.
  • Pedagogical Validation: Rigorous user studies with psycholinguistic measures.

The Collaborative Team

Impact & Milestones

🚀 Key Achievements

  • NAACL 2025 Demo Paper: "ATAIGI: An AI-Powered Multimodal Learning App Leveraging Generative Models for Low-Resource Taiwanese Hokkien". View Paper
  • LREC-COLING 2024: "Enhancing Taiwanese Hokkien Dual Translation by Exploring and Standardizing of Four Writing Systems".
  • SINITIC MT ERROR Dataset: Publicly released for the Sinitic NLP research community.
Supported by the UAAT Taiwan Affinity Program and the National Science and Technology Council (NSTC).
← Back to Collaborations