4. Components¶

4.1. Infrastructure¶

The system uses a two-instance Elasticsearch architecture on Pitt CRC infrastructure:

Component	Location	Purpose
Production Elasticsearch	VM, /ix3 (flash)	Live indices, query serving
Staging Elasticsearch	Slurm worker, local NVMe scratch	Index building, embedding enrichment
Authority files	/ix1 (bulk)	Source datasets
Snapshots	/ix1 (bulk)	Transfer mechanism, backup
Processing scripts	/ix1 (bulk)	Ingestion, embedding generation

The staging instance is ephemeral, spun up only for the duration of indexing jobs. It runs on local NVMe scratch storage ($SLURM_SCRATCH, ~870GB available), providing fast I/O for indexing operations. When the Slurm job completes, the staging instance and its local data are automatically cleaned up. Snapshots written to the shared /ix1 filesystem persist the completed indices for transfer to production.

4.2. Elasticsearch Indices¶

4.2.1. Places Index¶

Core gazetteer records with geometry and metadata. Each place references toponyms by their name@lang identifiers.

A place may have both phonetic variants of the same name and etymologically distinct endonyms/exonyms:

{
  "place_id": "wd:Q183",
  "namespace": "wd",
  "label": "Germany",
  "toponyms": ["Deutschland@de", "Germany@en", "Allemagne@fr", "Германия@ru", "ドイツ@ja"],
  "ccodes": ["DE"],
  "locations": [{
    "geometry": { "type": "Point", "coordinates": [10.45, 51.17] },
    "rep_point": { "lon": 10.45, "lat": 51.17 }
  }],
  "types": [{ "identifier": "Q6256", "label": "wikidata", "sourceLabel": "country" }],
  "relations": [{ "relationType": "sameAs", "relationTo": "gn:2921044", "certainty": 1.0 }]
}

Here “Deutschland” (endonym), “Germany”, “Allemagne”, “Германия”, and “ドイツ” are all exonyms with different etymological roots—they should not cluster together phonetically.

4.2.2. Toponyms Index¶

Unique name@language combinations with phonetic data. Each toponym appears once in this index, regardless of how many places share it:

Unique name@lang identifier (e.g., “London@en”)
Siamese BiLSTM phonetic embedding (128 dimensions)
Completion suggester for type-ahead

{
  "toponym_id": "London@en",
  "name": "London",
  "name_lower": "london",
  "lang": "en",
  "embedding_bilstm": [0.23, -0.15, ...],
  "suggest": { "input": ["London"], "contexts": { "lang": ["en"] } }
}

This design ensures:

Embedding efficiency: Each unique toponym is embedded once, not once per place
Storage optimisation: ~80M unique toponyms vs potentially hundreds of millions of place-toponym pairs
Query simplicity: Search the toponyms index, then join to places

4.3. Processing Components¶

4.3.1. Phonetic Embeddings (Siamese BiLSTM)¶

A character-level bidirectional LSTM, trained using a Siamese architecture, generates 128-dimensional dense vectors from toponym text. The Siamese training approach uses pairs (or triplets) of toponyms with shared weights to learn phonetic similarity directly.

The model:

Processes character sequences directly (no IPA required)
Learns phonetic similarity from positive/negative pairs
Generalises across scripts and languages
Enables approximate nearest neighbour search via Elasticsearch kNN

The Siamese BiLSTM approach was chosen over rule-based alternatives (e.g., PanPhon feature vectors) because it handles scripts and languages without explicit phonological rules, and learns what “sounds similar” from real-world toponym equivalences.

4.3.2. Completion Suggester¶

The suggest field provides type-ahead autocomplete functionality, with language context for filtering suggestions by user locale.