Eliza — AI Spanish Accent Coach

Eliza is a web app that gives phoneme-level pronunciation feedback to Spanish learners. It runs as a Hugging Face Space and uses a pipeline of acoustic speech models and a large language model to identify specific articulation errors and explain how to fix them.

Try the demo → Source code →

The Problem

Most language-learning tools grade pronunciation by transcribing your speech to text and comparing it to the expected text. This approach has a fundamental flaw: it routes through orthography, silently laundering accent errors.

If a learner says perro with an English retroflex /ɹ/ instead of the Spanish trill /r/, Whisper still transcribes it as “perro” and the canonical IPA gets handed back unchanged. The exact error — the one the learner needs to know about — is invisible to the pipeline.

The fix is to never go through text at all: extract the phoneme sequence directly from the audio signal, then compare it against a reference.

System Design

The Core Pipeline

Microphone audio
       │
       ▼
wav2vec2-lv-60-espeak-cv-ft   (audio → IPA phoneme sequence)
       │
       ▼
Needleman-Wunsch alignment    (user IPA ↔ reference IPA)
       │
       ▼
Error classifier              (7 phonological rules)
       │
       ▼
Claude Haiku                  (streaming coaching feedback)

1. Acoustic Phoneme Extraction — `facebook/wav2vec2-lv-60-espeak-cv-ft`

This wav2vec2 model is fine-tuned to output IPA phoneme sequences directly from raw audio using CTC decoding. It was chosen because:

It never sees word identity. Because there is no text step, it cannot paper over mispronunciations. If you produce an English /ɹ/, that is what comes out.
IPA output. The model outputs espeak-style IPA, which can be compared directly to a hardcoded reference transcription.
Open weights. The model runs entirely on CPU in a free Hugging Face Space — no paid inference API needed.

The tradeoff: the model was trained on CommonVoice, which skews toward native and near-native speakers. Its accuracy on heavily accented learner speech is not formally benchmarked. This is the most interesting future direction — fine-tuning on a learner corpus like L2-ARCTIC. In my testing, it works well enough.

2. Sequence Alignment — Needleman-Wunsch

Comparing phoneme sequences requires alignment because insertions and deletions shift the positions of everything downstream. This is especially important for the vowel exercise, where the user may insert a diphthong in place of a single vowel.

Needleman-Wunsch global alignment operates on lists of IPA characters (stress marks and syllable dots are stripped before alignment). It returns (reference_phoneme, hypothesis_phoneme) pairs, where None indicates a gap. This gives a precise, position-aware diff showing exactly which sounds were substituted, inserted, or deleted.

3. Error Classification

Seven phonological rules map alignment patterns to named error types:

Error	Expected	Produced	What it catches
`trill_substitution`	r	ɾ ʁ ɹ ɻ	Tap or approximant where trill required
`tap_substitution`	ɾ	r ɹ d ɻ	Trill or English r where tap required
`vowel_reduction`	a e i o u	ə ɪ ʊ ɐ	Pure vowel reduced to schwa or lax vowel
`vowel_tensing`	e	ɛ æ	Lax /ɛ/ where tense /e/ required
`diphthong_insertion`	o	oʊ	English off-glide added to /o/
`v_fricative`	b β	v	Labiodental /v/ where bilabial required
`b_hardening`	β	b	Hard stop where bilabial approximant required

The classifier also identifies which specific words each error occurred in, so the LLM can give word-specific feedback rather than vague generalities.

4. LLM Coaching — Claude Haiku

Detected errors and their word context are passed to Claude Haiku, which writes plain-English coaching feedback — no IPA symbols, no linguistics jargon. Each error gets a short explanation and a concrete exercise.

Claude Haiku was chosen over larger models because:

The task is well-specified: the prompt is structured data, and the output format is consistent.
Latency matters on free CPU hardware. Haiku is fast.
Cost per call is low, which matters for a public demo.

Feedback streams in real time, so users see words appear rather than waiting for the full response.

5. Target Phrases

Three phrases are presented sequentially, each targeting a distinct phonological challenge for English speakers:

Phrase	Focus
Pero el perro raro no quiere correr	Tap /ɾ/ vs. trill /r/ — the rhotic distinction
Sé que tomó café con leche	Pure vowels — no English diphthong glides on /e/ or /o/
Bebo vino bueno en la bodega	Bilabial B/V — Spanish b/v are never labiodental

Reference IPA is hardcoded, never generated at runtime, because the reference must be exact and stable.

What the App Does Not Do

Free-form input: phrases are fixed. The pipeline is tuned to structured comparison against a known reference.
Prosody, rhythm, or stress analysis.
Store audio or user data. Audio is processed in memory and discarded immediately.
Claim formal accuracy on learner speech. It works well enough to surface real errors in testing, but the underlying model was not evaluated on L2 data.

Tech Stack

Component	Choice
UI framework	Gradio (Hugging Face Spaces native)
Acoustic model	wav2vec2-lv-60-espeak-cv-ft (Hugging Face Transformers)
Audio loading	torchaudio
LLM	Claude Haiku (Anthropic API)
Hosting	Hugging Face Spaces (free CPU tier)

Repository

accent-coach/
├── app.py          # Gradio UI, pipeline orchestration, abuse controls
├── phonology.py    # Audio-to-IPA, alignment, error classification, diff display
├── feedback.py     # Claude API call (streaming) and prompt construction
├── requirements.txt
└── packages.txt