4. Elasticsearch Index Design¶
4.1. IPA Index¶
{
"ipa_id": "sha256_hash", // keyword (primary key)
"ipa": "spɹɪŋfild", // text + ngram analyzer
"ipa_stressed": "ˈspɹɪŋˌfild", // text (optional, for disambiguation)
"language": "en", // keyword
"embedding": [0.23, -0.15, ...], // dense_vector (64-256 dim)
"embedding_version": "v3_20251117", // keyword (critical for model updates)
"canonical_toponyms": [ // text[] (sample representations)
"Springfield",
"Springfeild" // historical variant
],
"panphon_features": [...], // dense_vector (24-dim, optional)
"last_updated": "2025-11-17T10:30:00Z" // date
}
Analyzers:
ipafield: Custom ngram analyzer (2-4 grams) for fuzzy phonetic matching.canonical_toponyms: Standard multilingual text analyzer.
4.2. Toponym Index¶
{
"toponym_id": "uuid", // keyword
"toponym": "Springfield", // text (multilingual analyzers)
"language": "en", // keyword
"ipa_id": "sha256_hash", // keyword (foreign key to ipa_index)
"place_ids": ["place_123", ...], // keyword[] (many-to-many)
"variants": ["Springfeild"], // text[] (historical spellings)
"last_updated": "2025-11-17T10:30:00Z"
}
4.3. Place Index¶
{
"place_id": "place_123", // keyword
"title": "Springfield, MA", // text
"geometry": {...}, // geo_shape
// ... existing WHG fields ...
"toponym_ids": ["uuid1", ...], // keyword[] (foreign keys)
"primary_ipa_id": "sha256_hash" // keyword (optional denormalisation)
}
Optional optimisation: Denormalise primary_ipa_id in place_index for direct vector search on places (bypasses toponym join for common cases).