15. Summary¶
This architecture provides a scalable, multilingual, phonetic-aware search system for the World Historical Gazetteer. The system is designed to handle approximately 40 million places and 80 million unique toponyms with sub-100ms query latency.
15.1. Key Design Decisions¶
15.1.1. Two-Instance Architecture¶
Staging and production Elasticsearch instances are separated to protect query performance:
Production (VM, /ix3): Persistent; serves live queries; receives completed indices via snapshot restore
Staging (Slurm worker): Ephemeral; spun up for indexing jobs; destroyed when job completes
The staging instance exists only for the duration of indexing operations. Snapshots written to /ix1 before job completion are the sole mechanism for persisting indexed data.
15.1.2. Storage Allocation¶
Storage is allocated by I/O requirements:
Flash storage (/ix3): 750GB - 1TB for production Elasticsearch indices
Bulk storage (/ix1): 1TB for authority files and snapshots
Local NVMe scratch: ~870GB for staging ES (automatically provisioned per Slurm job)
15.1.3. Two-Index Architecture¶
placesindex: Core gazetteer records with geometry and metadata; references toponyms byname@langtoponymsindex: Unique name@language combinations with phonetic embeddings
The toponyms index stores each unique name@language combination once, regardless of how many places share it. This optimises embedding generation and storage.
15.1.4. Siamese BiLSTM Embeddings¶
Character-level bidirectional LSTM trained with Siamese architecture generates 128-dimensional vectors:
Learns phonetic similarity from positive/negative toponym pairs
Processes text directly (no IPA required at query time)
Generalises across scripts and languages
Enables efficient kNN search via Elasticsearch HNSW
15.1.5. Dual Data Sources¶
The system indexes both authority files and WHG-contributed datasets:
Source |
Places |
Processing Location |
|---|---|---|
Authority files |
~39M |
Staging (Slurm worker) |
WHG contributions |
~200K |
Staging or production |
Unique toponyms: ~80M (deduplicated across all sources)
Both sources share the same indices and are searchable together. New contributions only require embedding generation for genuinely new name@language combinations.
15.1.6. Snapshot-Based Deployment¶
Zero-downtime updates through staged indexing:
Build and validate indices on staging
Create snapshot to shared repository (/ix1)
Restore snapshot to production
Switch aliases atomically
Rollback capability via previous snapshots
15.1.7. Graceful Degradation¶
Multi-stage search with fallbacks:
Vector kNN search (primary)
Fuzzy text search (fallback)
Completion suggester (autocomplete)
15.2. Technology Stack¶
Component |
Technology |
|---|---|
Search engine |
Elasticsearch 9.x |
Vector search |
HNSW via dense_vector |
Training data prep |
Epitran, PanPhon |
Embedding model |
PyTorch Siamese BiLSTM |
Inference runtime |
ONNX (VM), PyTorch+CUDA (compute) |
Storage |
Pitt CRC /ix1, /ix3 |
15.3. Storage Requirements¶
System |
Allocation |
Purpose |
|---|---|---|
/ix3 (flash) |
750GB - 1TB |
Production Elasticsearch data |
/ix1 (bulk) |
1TB |
Authority files, snapshots, training data |
Local NVMe scratch |
~870GB available |
Staging ES (per Slurm job) |
15.4. Timeline¶
Phase |
Duration |
Deliverables |
|---|---|---|
Infrastructure |
2 weeks |
Storage, Elasticsearch |
Core indexing |
4 weeks |
All authority sources indexed |
Model training |
4 weeks |
Training data, Siamese BiLSTM model |
Embedding generation |
4 weeks |
Embeddings for all toponyms |
Query integration |
4 weeks |
Search endpoints |
Production rollout |
2 weeks |
Live system |
Total: ~20 weeks from infrastructure provisioning to production
15.5. References¶
WHG Place Discussion #81 - Original phonetic search proposal
Epitran - Grapheme-to-phoneme library (training data preparation)
PanPhon - Phonological feature vectors for IPA segments
Elasticsearch kNN Search - Vector search documentation
Siamese Networks for One-Shot Learning - Foundational paper