4. Components¶
4.1. Infrastructure¶
The system uses a two-instance Elasticsearch architecture on Pitt CRC infrastructure:
Component |
Location |
Purpose |
|---|---|---|
Production Elasticsearch |
VM, /ix3 (flash) |
Live indices, query serving |
Staging Elasticsearch |
Slurm worker, local NVMe scratch |
Index building, embedding enrichment |
Authority files |
/ix1 (bulk) |
Source datasets |
Snapshots |
/ix1 (bulk) |
Transfer mechanism, backup |
Processing scripts |
/ix1 (bulk) |
Ingestion, embedding generation |
The staging instance is ephemeral, spun up only for the duration of indexing jobs. It runs on local NVMe scratch storage ($SLURM_SCRATCH, ~870GB available), providing fast I/O for indexing operations. When the Slurm job completes, the staging instance and its local data are automatically cleaned up. Snapshots written to the shared /ix1 filesystem persist the completed indices for transfer to production.
4.2. Elasticsearch Indices¶
4.2.1. Places Index¶
Core gazetteer records with geometry and metadata. Each place references toponyms by their name@lang identifiers.
A place may have both phonetic variants of the same name and etymologically distinct endonyms/exonyms:
{
"place_id": "wd:Q183",
"namespace": "wd",
"label": "Germany",
"toponyms": ["Deutschland@de", "Germany@en", "Allemagne@fr", "Германия@ru", "ドイツ@ja"],
"ccodes": ["DE"],
"locations": [{
"geometry": { "type": "Point", "coordinates": [10.45, 51.17] },
"rep_point": { "lon": 10.45, "lat": 51.17 }
}],
"types": [{ "identifier": "Q6256", "label": "wikidata", "sourceLabel": "country" }],
"relations": [{ "relationType": "sameAs", "relationTo": "gn:2921044", "certainty": 1.0 }]
}
Here “Deutschland” (endonym), “Germany”, “Allemagne”, “Германия”, and “ドイツ” are all exonyms with different etymological roots—they should not cluster together phonetically.
4.2.2. Toponyms Index¶
Unique name@language combinations with phonetic data. Each toponym appears once in this index, regardless of how many places share it:
Unique
name@langidentifier (e.g., “London@en”)Siamese BiLSTM phonetic embedding (128 dimensions)
Completion suggester for type-ahead
{
"toponym_id": "London@en",
"name": "London",
"name_lower": "london",
"lang": "en",
"embedding_bilstm": [0.23, -0.15, ...],
"suggest": { "input": ["London"], "contexts": { "lang": ["en"] } }
}
This design ensures:
Embedding efficiency: Each unique toponym is embedded once, not once per place
Storage optimisation: ~80M unique toponyms vs potentially hundreds of millions of place-toponym pairs
Query simplicity: Search the toponyms index, then join to places
4.3. Processing Components¶
4.3.1. Phonetic Embeddings (Siamese BiLSTM)¶
A character-level bidirectional LSTM, trained using a Siamese architecture, generates 128-dimensional dense vectors from toponym text. The Siamese training approach uses pairs (or triplets) of toponyms with shared weights to learn phonetic similarity directly.
The model:
Processes character sequences directly (no IPA required)
Learns phonetic similarity from positive/negative pairs
Generalises across scripts and languages
Enables approximate nearest neighbour search via Elasticsearch kNN
The Siamese BiLSTM approach was chosen over rule-based alternatives (e.g., PanPhon feature vectors) because it handles scripts and languages without explicit phonological rules, and learns what “sounds similar” from real-world toponym equivalences.
4.3.2. Completion Suggester¶
The suggest field provides type-ahead autocomplete functionality, with language context for filtering suggestions by user locale.