7. Multilingual Phonetic Toponym Matching Model

A memory-efficient, Student-Teacher neural network designed to learn phonetic similarity between place names across languages.

Note: This is not a string similarity model (like Levenshtein). It is a phonetic embedding model whose outputs are dense vectors compared using cosine similarity.

7.1. 1. Architecture Overview

This system bypasses the Grapheme-to-Phoneme (G2P) bottleneck by anchoring learning in a phonetic space and distilling that knowledge into a text-based model. It employs a Student-Teacher architecture:

7.1.1. The Teacher (Phonetic Encoder)

  • Input: International Phonetic Alphabet (IPA) sequences generated by Epitran.

  • Features: Converts IPA into articulatory feature vectors using PanPhon (e.g., voiced, nasal, rounded).

  • Model: BiLSTM.

  • Role: Learns a phonetically grounded reference space where names with similar pronunciations (e.g., London /lʌndən/ and Londen /lɔndən/) cluster tightly.

  • Limitation: Can only be trained on languages supported by Epitran.

7.1.2. The Student (Language-Conditioned Character Encoder)

  • Input: Romanized text (via anyascii) + Language ID.

  • Model: BiLSTM.

  • Mechanism: At every timestep, the character embedding is concatenated with a Language Embedding. This allows the model to learn context-specific pronunciation rules (e.g., ‘j’ is /dʒ/ in English but /x/ in Spanish) without explicit G2P rules at runtime.

  • Role: Approximates the Teacher’s reference space using only text inputs.

  • Advantage: Functions as a universal fallback for any language.

7.1.3. The Hybrid Inference Model

During inference, the system employs a gated fusion mechanism:

  1. If IPA is available: It uses a learned gate to dynamically weigh and fuse the Phonetic and Character embeddings.

  2. If IPA is unavailable: The gate closes, and the model relies 100% on the Student (character) embedding.


7.2. 2. Key Features

  • Phonetic Gatekeeper (Exonym Filter):

  • The model does not blindly trust Elasticsearch clusters. During extraction, it computes the articulatory edit distance between pairs.

  • Logic: If the phonetic similarity is below Config.SIMILARITY_THRESHOLD (default 0.5), the pair is discarded.

  • Result: The model learns that LondonLondres, but it is not forced to learn that GermanyDeutschland, preserving the integrity of the phonetic space.

  • Streaming Data Pipeline (HDF5):

  • Streams training data directly from disk via HDF5.

  • Memory Footprint: Constant ~100MB RAM usage for the data pipeline, regardless of dataset size.

  • Language Agnostic: Uses anyascii as a universal Romanization fallback, allowing the model to handle unseen scripts (e.g., matching Kanji to Katakana via Romanization).


7.3. 3. Installation

pip install torch h5py anyascii
# Optional but recommended for Phase 0-1 (The Teacher):
pip install epitran panphon elasticsearch

7.4. 4. The Training Pipeline

Training requires running the script in four distinct sequential phases. This separation ensures the Student learns a stable phonetic space before generalizing.

7.4.1. Phase 0: Data Extraction & Filtering

Streams data from an Elasticsearch index into an optimized HDF5 file.

  • The Guardrail: This phase applies the SIMILARITY_THRESHOLD. It calculates IPA features for all pairs and filters out semantic matches that are phonetically distinct (exonyms).

python phonetic_similarity_model.py --phase 0 --es-host localhost:9200 --index places --output data.h5

7.4.2. Phase 1: Teacher Training (Phonetic Grounding)

Trains the Phonetic Encoder using Triplet Loss.

  • Input: Only pairs where both sides have valid IPA generation.

  • Goal: Create a high-quality phonetic reference space.

python phonetic_similarity_model.py --phase 1 --data data.h5 --output phase1.pt

7.4.3. Phase 2: Alignment (Knowledge Distillation)

Trains the Student (Character Encoder) to mimic the frozen Teacher.

  • Loss: MSE (position) + Cosine (orientation).

  • Goal: Distill phonetic knowledge into the character model (e.g., learning that the “J” + “Español” vector aligns with the Teacher’s /x/ vector).

python phonetic_similarity_model.py --phase 2 --data data.h5 --phase1-model phase1.pt --output phase2.pt

7.4.4. Phase 3: Generalization (Fine-Tuning)

Fine-tunes the Student on all data (including languages the Teacher didn’t know).

  • Mechanism: Freezes the Phonetic Encoder and the Fusion Gate. Only the Student updates.

  • Goal: Improve separation of hard negatives and generalize to non-Epitran languages.

python phonetic_similarity_model.py --phase 3 --data data.h5 --phase2-model phase2.pt --output final_model.pt

7.5. 5. Python Inference API

The PhoneticSimilarityModel class wraps the complexity of the hybrid architecture and the gated fusion logic.

from phonetic_similarity_model import PhoneticSimilarityModel

# Load the trained model
model = PhoneticSimilarityModel('final_model.pt', device='cpu')

# 1. Compare two specific toponyms
score = model.similarity(
    toponym1="London", lang1="en", 
    toponym2="Londres", lang2="fr"
)
print(f"Similarity Score: {score:.4f}")

# 2. Get vector embedding (64-dim)
vector = model.embed("München", "de")

# 3. Batch processing (Faster)
candidates = [("Rome", "en"), ("Roma", "it"), ("Berlin", "de")]
results = model.find_similar("Roma", "es", candidates)


7.6. 6. Configuration

Hyperparameters are defined in the Config class at the top of the script. Key parameters to tune:

Parameter

Default

Description

VOCAB_SIZE

10,000

Size of character vocabulary. Stable hashing is used for overflow.

PHONETIC_FEAT_DIM

24

Dimension of PanPhon feature vectors (Articulatory features).

SIMILARITY_THRESHOLD

0.5

Critical: The PanPhon distance threshold used in Phase 0 to reject exonyms (e.g., Germany/Deutschland).

EMBED_DIM

64

Final output dimension of the embeddings.