5. Training the Siamese BiLSTM (Pitt)¶

5.1. Training Data Construction¶

Positive pairs (phonetically similar, same referent):

Same Wikidata ID across languages.
Same Geonames ID + language variants.
Documented aliases (e.g., “New York” ↔ “Nueva York”).
Identical normalised IPA (automatic positive).
Edit distance ≤2 on IPA + geographic proximity <50km.

Negative pairs (phonetically dissimilar or different referents):

Random sampling (easy negatives).
High IPA edit distance (>5) + same language.
Geographic distance >500km + same country (ambiguous names).
Same IPA, different language, distant locations (homophone disambiguation).

Hard negative mining: After initial training, mine pairs with high predicted similarity but known to be incorrect.

Data augmentation:

Wikidata/Geonames skew toward Western languages. Balance with:
- Minority language gazetteers (African, Asian, Indigenous).
- Historical attestations from WHG partner projects.

5.2. Model Architecture¶

Input:

IPA string tokenised at phone level (individual IPA characters).
Optional: Concatenate PanPhon 24-dim feature vectors per phone.

Encoder:

Input: IPA sequence (variable length)
  ↓
Embedding Layer (phone vocab ~200 symbols)
  ↓
Bidirectional LSTM (2 layers, 128 hidden units)
  ↓
Mean pooling over time steps
  ↓
Dense projection (ReLU)
  ↓
L2 normalisation → embedding (dimension D)

Loss: Contrastive loss or triplet loss with margin 0.2.

Dimension selection:

Start with D=64 (baseline).
Evaluate D=128, 256 on held-out test set.
Metrics: accuracy@k (k=1,5,10), memory usage, inference time.
Document decision in models/phonetic/VERSION_NOTES.md.

Training infrastructure:

PyTorch on Pitt CRC GPU nodes.
Checkpoint every 5 epochs.
Early stopping on validation set.

5.3. Embedding Refresh Cycle¶

Problem: When retraining the model, old embeddings become invalid.

Solution: Versioned embeddings with blue-green deployment.

Train new model → v4_20251201.
Re-embed all IPA entries at Pitt.
Push to Elasticsearch with embedding_version: "v4_20251201".
Deploy inference model to DigitalOcean:
- Export trained Siamese encoder to ONNX format.
- Quantize to INT8 for faster inference (optional).
- Deploy to Django application server.
- Update model version in Django settings.
Django switches query filter from v3 to v4 (config flag).
Gradual rollout:
- Week 1: 10% of traffic on v4.
- Week 2: 50%.
- Week 3: 100%, delete v3 embeddings.

Monitoring: Compare search quality metrics (click-through rate, user feedback) across versions.

5.4. Real-Time Inference Model¶

Requirements:

Model must be small enough to load in Django process memory (<100MB).
Inference latency <10ms per query on DigitalOcean servers.
Compatible with CPU-only deployment (no GPU required).

Implementation:

# Django query handler
from onnxruntime import InferenceSession
import numpy as np

class PhoneticQueryEncoder:
    def __init__(self, model_path, version):
        self.session = InferenceSession(model_path)
        self.version = version
    
    def encode(self, ipa_string):
        """Generate embedding for query IPA string."""
        # Tokenize IPA at phone level
        tokens = self.tokenize_ipa(ipa_string)
        # Run inference
        embedding = self.session.run(
            ['embedding'], 
            {'input': np.array([tokens])}
        )[0]
        return embedding.flatten()

Model optimization:

Prune rare phone tokens (reduce embedding table size).
Distill BiLSTM to smaller student model if needed.
Cache embeddings for common queries (Redis).

Deployment strategy:

Store model files in /opt/whg/models/phonetic/v{version}/.
Django loads model on startup.
Graceful fallback to previous version if new model fails to load.
Hot-reload capability for model updates without Django restart.