1.7. Multilingual Phonetic Toponym Matching Model¶

A memory-efficient, Student-Teacher neural network designed to learn phonetic similarity between place names across languages.

Note: This is not a string similarity model (like Levenshtein). It is a phonetic embedding model whose outputs are dense vectors compared using cosine similarity.

1.7.1. 1. Architecture Overview¶

This system bypasses the Grapheme-to-Phoneme (G2P) bottleneck by anchoring learning in a phonetic space and distilling that knowledge into a text-based model. It employs a Student-Teacher architecture:

1.7.1.1. The Teacher (Phonetic Encoder)¶

Input: International Phonetic Alphabet (IPA) sequences generated by Epitran.
Features: Converts IPA into articulatory feature vectors using PanPhon (e.g., voiced, nasal, rounded).
Model: BiLSTM.
Role: Learns a phonetically grounded reference space where names with similar pronunciations (e.g., London /lʌndən/ and Londen /lɔndən/) cluster tightly.
Limitation: Can only be trained on languages supported by Epitran.

1.7.1.2. The Student (Language-Conditioned Character Encoder)¶

Input: Romanized text (via anyascii) + Language ID.
Model: BiLSTM.
Mechanism: At every timestep, the character embedding is concatenated with a Language Embedding. This allows the model to learn context-specific pronunciation rules (e.g., ‘j’ is /dʒ/ in English but /x/ in Spanish) without explicit G2P rules at runtime.
Role: Approximates the Teacher’s reference space using only text inputs.
Advantage: Functions as a universal fallback for any language.

1.7.1.3. The Hybrid Inference Model¶

During inference, the system employs a gated fusion mechanism:

If IPA is available: It uses a learned gate to dynamically weigh and fuse the Phonetic and Character embeddings.
If IPA is unavailable: The gate closes, and the model relies 100% on the Student (character) embedding.

1.7.2. 2. Key Features¶

Phonetic Gatekeeper (Exonym Filter):
The model does not blindly trust Elasticsearch clusters. During extraction, it computes the articulatory edit distance between pairs.
Logic: If the phonetic similarity is below Config.SIMILARITY_THRESHOLD (default 0.5), the pair is discarded.
Result: The model learns that London ≈ Londres, but it is not forced to learn that Germany ≈ Deutschland, preserving the integrity of the phonetic space.
Streaming Data Pipeline (HDF5):
Streams training data directly from disk via HDF5.
Memory Footprint: Constant ~100MB RAM usage for the data pipeline, regardless of dataset size.
Language Agnostic: Uses anyascii as a universal Romanization fallback, allowing the model to handle unseen scripts (e.g., matching Kanji to Katakana via Romanization).

1.7.3. 3. Installation¶

pip install torch h5py anyascii
# Optional but recommended for Phase 0-1 (The Teacher):
pip install epitran panphon elasticsearch

1.7.4. 4. The Training Pipeline¶

Training requires running the script in four distinct sequential phases. This separation ensures the Student learns a stable phonetic space before generalizing.

1.7.4.1. Phase 0: Data Extraction & Filtering¶

Streams data from an Elasticsearch index into an optimized HDF5 file.

The Guardrail: This phase applies the SIMILARITY_THRESHOLD. It calculates IPA features for all pairs and filters out semantic matches that are phonetically distinct (exonyms).

python phonetic_similarity_model.py --phase 0 --es-host localhost:9200 --index places --output data.h5

1.7.4.2. Phase 1: Teacher Training (Phonetic Grounding)¶

Trains the Phonetic Encoder using Triplet Loss.

Input: Only pairs where both sides have valid IPA generation.
Goal: Create a high-quality phonetic reference space.

python phonetic_similarity_model.py --phase 1 --data data.h5 --output phase1.pt

1.7.4.3. Phase 2: Alignment (Knowledge Distillation)¶

Trains the Student (Character Encoder) to mimic the frozen Teacher.

Loss: MSE (position) + Cosine (orientation).
Goal: Distill phonetic knowledge into the character model (e.g., learning that the “J” + “Español” vector aligns with the Teacher’s /x/ vector).

python phonetic_similarity_model.py --phase 2 --data data.h5 --phase1-model phase1.pt --output phase2.pt

1.7.4.4. Phase 3: Generalization (Fine-Tuning)¶

Fine-tunes the Student on all data (including languages the Teacher didn’t know).

Mechanism: Freezes the Phonetic Encoder and the Fusion Gate. Only the Student updates.
Goal: Improve separation of hard negatives and generalize to non-Epitran languages.

python phonetic_similarity_model.py --phase 3 --data data.h5 --phase2-model phase2.pt --output final_model.pt

1.7.5. 5. Python Inference API¶

The PhoneticSimilarityModel class wraps the complexity of the hybrid architecture and the gated fusion logic.

from phonetic_similarity_model import PhoneticSimilarityModel

# Load the trained model
model = PhoneticSimilarityModel('final_model.pt', device='cpu')

# 1. Compare two specific toponyms
score = model.similarity(
    toponym1="London", lang1="en", 
    toponym2="Londres", lang2="fr"
)
print(f"Similarity Score: {score:.4f}")

# 2. Get vector embedding (64-dim)
vector = model.embed("München", "de")

# 3. Batch processing (Faster)
candidates = [("Rome", "en"), ("Roma", "it"), ("Berlin", "de")]
results = model.find_similar("Roma", "es", candidates)

1.7.6. 6. Configuration¶

Hyperparameters are defined in the Config class at the top of the script. Key parameters to tune:

Parameter	Default	Description
`VOCAB_SIZE`	10,000	Size of character vocabulary. Stable hashing is used for overflow.
`PHONETIC_FEAT_DIM`	24	Dimension of PanPhon feature vectors (Articulatory features).
`SIMILARITY_THRESHOLD`	0.5	Critical: The PanPhon distance threshold used in Phase 0 to reject exonyms (e.g., Germany/Deutschland).
`EMBED_DIM`	64	Final output dimension of the embeddings.