Extended fork of NVIDIA’s NeMo text normalization framework, adding a specialized Emirati Arabic (ar_ae) module for production TTS frontends.
What’s new
A complete WFST-based normalization pipeline for Gulf Arabic dialect, handling:
- Numbers — cardinal, ordinal, Arabic-Indic numerals
- Dates — Hijri and Gregorian calendar formats
- Currencies — 20+ Arab world currencies (AED, SAR, KWD, QAR, BHD, OMR…)
- Fractions — Arabic fraction expressions
- Phone numbers — UAE and GCC formats
- Electronic entities — URLs, emails in mixed Arabic/Latin text
Coverage
Supports 16+ languages from the base NeMo package, with Emirati Arabic as a first-class addition covering the linguistic specifics of the Gulf dialect — distinct from Modern Standard Arabic normalization.
Use case
Used as the text normalization frontend for the Emirati Arabic TTS pipeline (FastPitch + HiFi-GAN + VITS), ensuring proper verbalization before phoneme conversion.
Python · WFST · Apache 2.0