7. Multilingual Phonetic Toponym Matching Model¶
A memory-efficient, Student-Teacher neural network designed to learn phonetic similarity between place names across languages.
Note: This is not a string similarity model (like Levenshtein). It is a phonetic embedding model whose outputs are dense vectors compared using cosine similarity.
7.1. 1. Architecture Overview¶
This system bypasses the Grapheme-to-Phoneme (G2P) bottleneck by anchoring learning in a phonetic space and distilling that knowledge into a text-based model. It employs a Student-Teacher architecture:
7.1.1. The Teacher (Phonetic Encoder)¶
Input: International Phonetic Alphabet (IPA) sequences generated by
Epitran.Features: Converts IPA into articulatory feature vectors using
PanPhon(e.g., voiced, nasal, rounded).Model: BiLSTM.
Role: Learns a phonetically grounded reference space where names with similar pronunciations (e.g., London /lʌndən/ and Londen /lɔndən/) cluster tightly.
Limitation: Can only be trained on languages supported by Epitran.
7.1.2. The Student (Language-Conditioned Character Encoder)¶
Input: Romanized text (via
anyascii) + Language ID.Model: BiLSTM.
Mechanism: At every timestep, the character embedding is concatenated with a Language Embedding. This allows the model to learn context-specific pronunciation rules (e.g., ‘j’ is /dʒ/ in English but /x/ in Spanish) without explicit G2P rules at runtime.
Role: Approximates the Teacher’s reference space using only text inputs.
Advantage: Functions as a universal fallback for any language.
7.1.3. The Hybrid Inference Model¶
During inference, the system employs a gated fusion mechanism:
If IPA is available: It uses a learned gate to dynamically weigh and fuse the Phonetic and Character embeddings.
If IPA is unavailable: The gate closes, and the model relies 100% on the Student (character) embedding.
7.2. 2. Key Features¶
Phonetic Gatekeeper (Exonym Filter):
The model does not blindly trust Elasticsearch clusters. During extraction, it computes the articulatory edit distance between pairs.
Logic: If the phonetic similarity is below
Config.SIMILARITY_THRESHOLD(default 0.5), the pair is discarded.Result: The model learns that London ≈ Londres, but it is not forced to learn that Germany ≈ Deutschland, preserving the integrity of the phonetic space.
Streaming Data Pipeline (HDF5):
Streams training data directly from disk via HDF5.
Memory Footprint: Constant ~100MB RAM usage for the data pipeline, regardless of dataset size.
Language Agnostic: Uses
anyasciias a universal Romanization fallback, allowing the model to handle unseen scripts (e.g., matching Kanji to Katakana via Romanization).
7.3. 3. Installation¶
pip install torch h5py anyascii
# Optional but recommended for Phase 0-1 (The Teacher):
pip install epitran panphon elasticsearch
7.4. 4. The Training Pipeline¶
Training requires running the script in four distinct sequential phases. This separation ensures the Student learns a stable phonetic space before generalizing.
7.4.1. Phase 0: Data Extraction & Filtering¶
Streams data from an Elasticsearch index into an optimized HDF5 file.
The Guardrail: This phase applies the
SIMILARITY_THRESHOLD. It calculates IPA features for all pairs and filters out semantic matches that are phonetically distinct (exonyms).
python phonetic_similarity_model.py --phase 0 --es-host localhost:9200 --index places --output data.h5
7.4.2. Phase 1: Teacher Training (Phonetic Grounding)¶
Trains the Phonetic Encoder using Triplet Loss.
Input: Only pairs where both sides have valid IPA generation.
Goal: Create a high-quality phonetic reference space.
python phonetic_similarity_model.py --phase 1 --data data.h5 --output phase1.pt
7.4.3. Phase 2: Alignment (Knowledge Distillation)¶
Trains the Student (Character Encoder) to mimic the frozen Teacher.
Loss: MSE (position) + Cosine (orientation).
Goal: Distill phonetic knowledge into the character model (e.g., learning that the “J” + “Español” vector aligns with the Teacher’s /x/ vector).
python phonetic_similarity_model.py --phase 2 --data data.h5 --phase1-model phase1.pt --output phase2.pt
7.4.4. Phase 3: Generalization (Fine-Tuning)¶
Fine-tunes the Student on all data (including languages the Teacher didn’t know).
Mechanism: Freezes the Phonetic Encoder and the Fusion Gate. Only the Student updates.
Goal: Improve separation of hard negatives and generalize to non-Epitran languages.
python phonetic_similarity_model.py --phase 3 --data data.h5 --phase2-model phase2.pt --output final_model.pt
7.5. 5. Python Inference API¶
The PhoneticSimilarityModel class wraps the complexity of the hybrid architecture and the gated fusion logic.
from phonetic_similarity_model import PhoneticSimilarityModel
# Load the trained model
model = PhoneticSimilarityModel('final_model.pt', device='cpu')
# 1. Compare two specific toponyms
score = model.similarity(
toponym1="London", lang1="en",
toponym2="Londres", lang2="fr"
)
print(f"Similarity Score: {score:.4f}")
# 2. Get vector embedding (64-dim)
vector = model.embed("München", "de")
# 3. Batch processing (Faster)
candidates = [("Rome", "en"), ("Roma", "it"), ("Berlin", "de")]
results = model.find_similar("Roma", "es", candidates)
7.6. 6. Configuration¶
Hyperparameters are defined in the Config class at the top of the script. Key parameters to tune:
Parameter |
Default |
Description |
|---|---|---|
|
10,000 |
Size of character vocabulary. Stable hashing is used for overflow. |
|
24 |
Dimension of PanPhon feature vectors (Articulatory features). |
|
0.5 |
Critical: The PanPhon distance threshold used in Phase 0 to reject exonyms (e.g., Germany/Deutschland). |
|
64 |
Final output dimension of the embeddings. |