1. Overview¶

This document outlines the architecture for multilingual phonetic search in the World Historical Gazetteer (WHG). The system uses phonetic embeddings to enable cross-lingual place name similarity search across approximately 80 million toponyms.

The infrastructure is hosted at the University of Pittsburgh Center for Research Computing (Pitt CRC), using a two-instance Elasticsearch architecture:

Production instance (VM on /ix3): Serves live queries with high availability
Staging instance (Slurm worker): Handles indexing workloads without impacting production

Authority source files and snapshots are maintained on bulk storage (/ix1), with snapshots serving as the transfer mechanism between staging and production.

1.1. Goals¶

Enable phonetic similarity search for historical place names
Support cross-lingual matching of phonetic variants (e.g., “München” ↔ “Munich” ↔ “Мюнхен”)
Distinguish phonetic variants from unrelated endonyms/exonyms (e.g., “Deutschland” vs “Germany” vs “Allemagne”)
Handle historical spelling variants and transcription differences
Provide robust fallback paths when phonetic matching fails

1.2. Why Not Elasticsearch’s Built-in Phonetic Analysis?¶

Elasticsearch provides phonetic token filters (Soundex, Metaphone, Double Metaphone, Beider-Morse, etc.) but these have significant limitations for multilingual gazetteer data:

Limitation	Impact on WHG
English-centric	Algorithms designed for English phonology; poor results for German, Slavic, Arabic, etc.
Single script	Cannot process Cyrillic, Greek, Arabic, CJK, or other non-Latin scripts
No learning	Fixed rules cannot adapt to toponym-specific patterns or historical forms
Coarse matching	Binary bucket assignment produces many false positives and negatives
No cross-lingual awareness	Cannot learn that “München” and “Мюнхен” represent the same sounds

The Siamese BiLSTM approach addresses these limitations:

Advantage	How it helps
Multilingual by construction	Trained on cross-lingual equivalences from Wikidata/GeoNames
Script-agnostic	Characters are tokens; learns patterns across Latin, Cyrillic, Greek, CJK
Domain-tuned	Trained specifically on place name equivalences, not general vocabulary
Continuous similarity	Returns distance scores enabling ranked results, not binary matches
Learnable	Can improve with more training data; adapts to historical forms if present in sources

The goal is not a perfect phonetic model, but a significant improvement over rule-based algorithms for the specific domain of historical and multilingual place names.

1.3. Limitations¶

This approach has known limitations that should be understood:

Training data dependency: The model learns phonetic similarity from equivalences present in GeoNames and Wikidata. If a historical spelling variant is very different from anything in the training data, the model may not recognise it. Coverage depends on the richness of historical toponyms in the source authorities.

No explicit phonological knowledge: The model learns character-level patterns implicitly, not from linguistic rules. It has no understanding of historical sound changes (e.g., the Great Vowel Shift) — it can only generalise from patterns it has seen.

Endonym/exonym clustering is imperfect: The initial PanPhon-based clustering that bootstraps training is limited by Epitran’s language coverage (~30 languages). Iterative refinement improves this, but errors in early clustering can propagate.

Novel scripts: Scripts not well-represented in training data (e.g., rare historical scripts, minority languages) will have weaker coverage.

Not a replacement for expert knowledge: For serious historical research, phonetic search is a discovery aid, not a definitive matching system. Results should be validated by domain experts.

Despite these limitations, the system should significantly outperform Elasticsearch’s built-in phonetic algorithms for multilingual and cross-script matching, which is the primary goal.

1.4. Architecture Summary¶

The phonetic search system extends the core WHG Elasticsearch deployment:

places index: Core place records with geometry, classifications, and cross-references
toponyms index: Unique name@language combinations with phonetic embeddings

The system uses a two-instance architecture: a persistent production Elasticsearch on the VM, and an ephemeral staging Elasticsearch spun up on Slurm workers only for the duration of indexing jobs. This protects production from indexing workload while leveraging compute node resources for batch processing.

The toponyms index is designed for deduplication: each unique name@language string appears only once, regardless of how many places share that name. This optimises embedding generation (computed once per unique toponym) and storage (embeddings stored once, referenced by many places).

Phonetic search is implemented via dense vector similarity on Siamese BiLSTM embeddings stored in the toponyms index. The system supports multiple query strategies with graceful degradation.

1.5. Data Sources¶

The system indexes two categories of place data:

1.5.1. Authority Files¶

Large-scale reference gazetteers used for reconciliation and enrichment:

GeoNames, Wikidata, Getty TGN, OpenStreetMap, Pleiades, and others
~39 million places
Indexed on staging Elasticsearch (Slurm worker)
Embeddings generated in batch on compute nodes
Transferred to production via snapshot/restore

1.5.2. WHG-Contributed Datasets¶

Scholarly datasets contributed by WHG users and partner projects:

~200,000 places (and growing)
The core research content of WHG
Converted to canonical JSON format for ingestion
Embeddings generated on-the-fly by the VM during indexing

1.5.3. Toponyms Index¶

The toponyms index contains unique name@language combinations extracted from all sources:

Estimated ~80 million unique toponyms across all authorities and contributions
Each toponym stored once, regardless of how many places share it
Places reference toponyms; toponyms carry the embeddings
This design avoids redundant embedding computation and storage

Both authority and contributed places are searchable together, with phonetic matching powered by the shared toponyms index.

Reference: Technical background in WHG Place Discussion #81