ODB: Introducing Eliza
Eliza — AI Spanish Accent Coach
Eliza is a web app that gives phoneme-level pronunciation feedback to Spanish learners. It runs as a Hugging Face Space and uses a pipeline of acoustic speech models and a large language model to identify specific articulation errors and explain how to fix them.
The Problem
Most language-learning tools grade pronunciation by transcribing your speech to text and comparing it to the expected text. This approach has a fundamental flaw: it routes through orthography, silently laundering accent errors.
If a learner says perro with an English retroflex /ɹ/ instead of the Spanish trill /r/, Whisper still transcribes it as “perro” and the canonical IPA gets handed back unchanged. The exact error — the one the learner needs to know about — is invisible to the pipeline.
The fix is to never go through text at all: extract the phoneme sequence directly from the audio signal, then compare it against a reference.
System Design
The Core Pipeline
Microphone audio
│
▼
wav2vec2-lv-60-espeak-cv-ft (audio → IPA phoneme sequence)
│
▼
Needleman-Wunsch alignment (user IPA ↔ reference IPA)
│
▼
Error classifier (7 phonological rules)
│
▼
Claude Haiku (streaming coaching feedback)
1. Acoustic Phoneme Extraction — facebook/wav2vec2-lv-60-espeak-cv-ft
This wav2vec2 model is fine-tuned to output IPA phoneme sequences directly from raw audio using CTC decoding. It was chosen because:
- It never sees word identity. Because there is no text step, it cannot paper over mispronunciations. If you produce an English /ɹ/, that is what comes out.
- IPA output. The model outputs espeak-style IPA, which can be compared directly to a hardcoded reference transcription.
- Open weights. The model runs entirely on CPU in a free Hugging Face Space — no paid inference API needed.
The tradeoff: the model was trained on CommonVoice, which skews toward native and near-native speakers. Its accuracy on heavily accented learner speech is not formally benchmarked. This is the most interesting future direction — fine-tuning on a learner corpus like L2-ARCTIC. In my testing, it works well enough.
2. Sequence Alignment — Needleman-Wunsch
Comparing phoneme sequences requires alignment because insertions and deletions shift the positions of everything downstream. This is especially important for the vowel exercise, where the user may insert a diphthong in place of a single vowel.
Needleman-Wunsch global alignment operates on lists of IPA characters (stress marks and syllable dots are stripped before alignment). It returns (reference_phoneme, hypothesis_phoneme) pairs, where None indicates a gap. This gives a precise, position-aware diff showing exactly which sounds were substituted, inserted, or deleted.
3. Error Classification
Seven phonological rules map alignment patterns to named error types:
| Error | Expected | Produced | What it catches |
|---|---|---|---|
trill_substitution | r | ɾ ʁ ɹ ɻ | Tap or approximant where trill required |
tap_substitution | ɾ | r ɹ d ɻ | Trill or English r where tap required |
vowel_reduction | a e i o u | ə ɪ ʊ ɐ | Pure vowel reduced to schwa or lax vowel |
vowel_tensing | e | ɛ æ | Lax /ɛ/ where tense /e/ required |
diphthong_insertion | o | oʊ | English off-glide added to /o/ |
v_fricative | b β | v | Labiodental /v/ where bilabial required |
b_hardening | β | b | Hard stop where bilabial approximant required |
The classifier also identifies which specific words each error occurred in, so the LLM can give word-specific feedback rather than vague generalities.
4. LLM Coaching — Claude Haiku
Detected errors and their word context are passed to Claude Haiku, which writes plain-English coaching feedback — no IPA symbols, no linguistics jargon. Each error gets a short explanation and a concrete exercise.
Claude Haiku was chosen over larger models because:
- The task is well-specified: the prompt is structured data, and the output format is consistent.
- Latency matters on free CPU hardware. Haiku is fast.
- Cost per call is low, which matters for a public demo.
Feedback streams in real time, so users see words appear rather than waiting for the full response.
5. Target Phrases
Three phrases are presented sequentially, each targeting a distinct phonological challenge for English speakers:
| Phrase | Focus |
|---|---|
| Pero el perro raro no quiere correr | Tap /ɾ/ vs. trill /r/ — the rhotic distinction |
| Sé que tomó café con leche | Pure vowels — no English diphthong glides on /e/ or /o/ |
| Bebo vino bueno en la bodega | Bilabial B/V — Spanish b/v are never labiodental |
Reference IPA is hardcoded, never generated at runtime, because the reference must be exact and stable.
What the App Does Not Do
- Free-form input: phrases are fixed. The pipeline is tuned to structured comparison against a known reference.
- Prosody, rhythm, or stress analysis.
- Store audio or user data. Audio is processed in memory and discarded immediately.
- Claim formal accuracy on learner speech. It works well enough to surface real errors in testing, but the underlying model was not evaluated on L2 data.
Tech Stack
| Component | Choice |
|---|---|
| UI framework | Gradio (Hugging Face Spaces native) |
| Acoustic model | wav2vec2-lv-60-espeak-cv-ft (Hugging Face Transformers) |
| Audio loading | torchaudio |
| LLM | Claude Haiku (Anthropic API) |
| Hosting | Hugging Face Spaces (free CPU tier) |
Repository
accent-coach/
├── app.py # Gradio UI, pipeline orchestration, abuse controls
├── phonology.py # Audio-to-IPA, alignment, error classification, diff display
├── feedback.py # Claude API call (streaming) and prompt construction
├── requirements.txt
└── packages.txt