Research Projects Blog Agent Skill Publications Contact
Projects  /  NeMo Text Processing — Arabic Emirati

📝  NeMo Text Processing — Arabic Emirati

Fork of NVIDIA NeMo text normalization adding Emirati Arabic (ar_ae) dialect support. WFST-based normalization for numbers, dates, currencies, and Gulf Arabic-specific entities.

NLP
Text Normalization
Arabic
Emirati
WFST
NeMo

Extended fork of NVIDIA’s NeMo text normalization framework, adding a specialized Emirati Arabic (ar_ae) module for production TTS frontends.

What’s new

A complete WFST-based normalization pipeline for Gulf Arabic dialect, handling:

  • Numbers — cardinal, ordinal, Arabic-Indic numerals
  • Dates — Hijri and Gregorian calendar formats
  • Currencies — 20+ Arab world currencies (AED, SAR, KWD, QAR, BHD, OMR…)
  • Fractions — Arabic fraction expressions
  • Phone numbers — UAE and GCC formats
  • Electronic entities — URLs, emails in mixed Arabic/Latin text

Coverage

Supports 16+ languages from the base NeMo package, with Emirati Arabic as a first-class addition covering the linguistic specifics of the Gulf dialect — distinct from Modern Standard Arabic normalization.

Use case

Used as the text normalization frontend for the Emirati Arabic TTS pipeline (FastPitch + HiFi-GAN + VITS), ensuring proper verbalization before phoneme conversion.

Python · WFST · Apache 2.0