3. Data Flow¶

3.1. Initial Migration (One-Time, at Pitt)¶

Goal: Enrich all existing WHG toponyms with IPA and embeddings without disrupting production.

Extract all toponyms from production PostgreSQL via read-only replica or database dump.
Generate IPA using Epitran for each (toponym, language) pair:
- Apply language-specific G2P models.
- Handle unsupported languages via multi-language fallback or transliteration.
Normalise IPA (see IPA Normalisation Rules).
Deduplicate by (normalised_ipa, language).
Assign ipa_id: SHA-256 hash of (normalised_ipa, language).
Generate embeddings: Initially rule-based (PanPhon feature vectors); later replaced by Siamese BiLSTM.
Build mappings:
- toponym_id → ipa_id
- ipa_id → [canonical_toponym_samples] (for display)
Bulk push to Elasticsearch:
- Populate ipa_index with all unique IPA forms.
- Update toponym_index with ipa_id foreign keys.
Verify with checksum validation (see Resilience Strategy).

Lightweight real-time enrichment

For each toponym:

Design decision: Do NOT maintain place_ids[] arrays in ipa_index.

Rationale:

Instead: Join at query time via toponym_index:

ipa_index (vector search) 
  → ipa_id 
  → toponym_index (filter by ipa_id) 
  → toponym_id 
  → place_index

Epitran:

PanPhon:

Provides phonetic feature vectors (24-dimensional) for each IPA segment.
Used for:
- Initial rule-based embeddings (averaged feature vectors).
- Augmenting Siamese model inputs with articulatory features.

Fallback Strategy:

If Epitran lacks a language: attempt transliteration → IPA via nearest-supported language.
Log unsupported language requests for future model expansion.

Critical: Consistent normalisation prevents duplicate ipa_id values.

Unicode NFC normalisation (canonical decomposition + composition).
Remove stress marks (ˈ primary, ˌ secondary) for base form.
- Optionally retain in separate ipa_stressed field if needed for disambiguation.
Remove syllable boundaries (. character) unless linguistically significant.
Canonical diacritic ordering (e.g., nasalisation before length).
Strip leading/trailing whitespace.
Lowercase (IPA is case-sensitive but we normalise for consistency).

Documentation: Full normalisation code in whg/phonetic/ipa_normalizer.py with extensive unit tests.