3. Data Flow¶
3.1. Initial Migration (One-Time, at Pitt)¶
Goal: Enrich all existing WHG toponyms with IPA and embeddings without disrupting production.
Extract all toponyms from production PostgreSQL via read-only replica or database dump.
Generate IPA using Epitran for each
(toponym, language)pair:Apply language-specific G2P models.
Handle unsupported languages via multi-language fallback or transliteration.
Normalise IPA (see IPA Normalisation Rules).
Deduplicate by
(normalised_ipa, language).Assign
ipa_id: SHA-256 hash of(normalised_ipa, language).Generate embeddings: Initially rule-based (PanPhon feature vectors); later replaced by Siamese BiLSTM.
Build mappings:
toponym_id → ipa_idipa_id → [canonical_toponym_samples](for display)
Bulk push to Elasticsearch:
Populate
ipa_indexwith all unique IPA forms.Update
toponym_indexwithipa_idforeign keys.
Verify with checksum validation (see Resilience Strategy).
3.2. Ongoing Ingestion (New Datasets)¶
Lightweight real-time enrichment
Django ingests dataset with toponyms only.
Async background task generates IPA using lightweight Epitran on DigitalOcean.
Updates
toponym_indexincrementally.Periodic Pitt re-embedding job (monthly) ensures consistency.
3.3. Linking Toponyms to IPA¶
For each toponym:
Generate IPA using Epitran (online or offline).
Normalise IPA.
Hash to get
ipa_id.Store
toponym_id → ipa_idmapping intoponym_index.
3.4. Linking IPA to Places (Implicit)¶
Design decision: Do NOT maintain place_ids[] arrays in ipa_index.
Rationale:
Common toponyms like “Springfield” would create massive arrays.
Array updates create write bottlenecks and reindexing overhead.
Instead: Join at query time via toponym_index:
ipa_index (vector search)
→ ipa_id
→ toponym_index (filter by ipa_id)
→ toponym_id
→ place_index
3.5. IPA Generation: Epitran + PanPhon¶
Epitran:
Supports 90+ language G2P mappings.
Produces IPA transcriptions directly from orthographic input.
Handles language-specific phonological rules.
PanPhon:
Provides phonetic feature vectors (24-dimensional) for each IPA segment.
Used for:
Initial rule-based embeddings (averaged feature vectors).
Augmenting Siamese model inputs with articulatory features.
Fallback Strategy:
If Epitran lacks a language: attempt transliteration → IPA via nearest-supported language.
Log unsupported language requests for future model expansion.
3.6. IPA Normalisation Rules¶
Critical: Consistent normalisation prevents duplicate ipa_id values.
Unicode NFC normalisation (canonical decomposition + composition).
Remove stress marks (
ˈprimary,ˌsecondary) for base form.Optionally retain in separate
ipa_stressedfield if needed for disambiguation.
Remove syllable boundaries (
.character) unless linguistically significant.Canonical diacritic ordering (e.g., nasalisation before length).
Strip leading/trailing whitespace.
Lowercase (IPA is case-sensitive but we normalise for consistency).
Documentation: Full normalisation code in whg/phonetic/ipa_normalizer.py with extensive unit tests.