15. Summary¶

This architecture provides a scalable, multilingual, phonetic-aware search system for the World Historical Gazetteer. The system is designed to handle approximately 40 million places and 80 million unique toponyms with sub-100ms query latency.

15.1. Key Design Decisions¶

15.1.1. Two-Instance Architecture¶

Staging and production Elasticsearch instances are separated to protect query performance:

Production (VM, /ix3): Persistent; serves live queries; receives completed indices via snapshot restore
Staging (Slurm worker): Ephemeral; spun up for indexing jobs; destroyed when job completes

The staging instance exists only for the duration of indexing operations. Snapshots written to /ix1 before job completion are the sole mechanism for persisting indexed data.

15.1.2. Storage Allocation¶

Storage is allocated by I/O requirements:

Flash storage (/ix3): 750GB - 1TB for production Elasticsearch indices
Bulk storage (/ix1): 1TB for authority files and snapshots
Local NVMe scratch: ~870GB for staging ES (automatically provisioned per Slurm job)

15.1.3. Two-Index Architecture¶

places index: Core gazetteer records with geometry and metadata; references toponyms by name@lang
toponyms index: Unique name@language combinations with phonetic embeddings

The toponyms index stores each unique name@language combination once, regardless of how many places share it. This optimises embedding generation and storage.

15.1.4. Siamese BiLSTM Embeddings¶

Character-level bidirectional LSTM trained with Siamese architecture generates 128-dimensional vectors:

Learns phonetic similarity from positive/negative toponym pairs
Processes text directly (no IPA required at query time)
Generalises across scripts and languages
Enables efficient kNN search via Elasticsearch HNSW

15.1.5. Dual Data Sources¶

The system indexes both authority files and WHG-contributed datasets:

Source	Places	Processing Location
Authority files	~39M	Staging (Slurm worker)
WHG contributions	~200K	Staging or production

Unique toponyms: ~80M (deduplicated across all sources)

Both sources share the same indices and are searchable together. New contributions only require embedding generation for genuinely new name@language combinations.

15.1.6. Snapshot-Based Deployment¶

Zero-downtime updates through staged indexing:

Build and validate indices on staging
Create snapshot to shared repository (/ix1)
Restore snapshot to production
Switch aliases atomically
Rollback capability via previous snapshots

15.1.7. Graceful Degradation¶

Multi-stage search with fallbacks:

Vector kNN search (primary)
Fuzzy text search (fallback)
Completion suggester (autocomplete)

15.2. Technology Stack¶

Component	Technology
Search engine	Elasticsearch 9.x
Vector search	HNSW via dense_vector
Training data prep	Epitran, PanPhon
Embedding model	PyTorch Siamese BiLSTM
Inference runtime	ONNX (VM), PyTorch+CUDA (compute)
Storage	Pitt CRC /ix1, /ix3

15.3. Storage Requirements¶

System	Allocation	Purpose
/ix3 (flash)	750GB - 1TB	Production Elasticsearch data
/ix1 (bulk)	1TB	Authority files, snapshots, training data
Local NVMe scratch	~870GB available	Staging ES (per Slurm job)

15.4. Timeline¶

Phase	Duration	Deliverables
Infrastructure	2 weeks	Storage, Elasticsearch
Core indexing	4 weeks	All authority sources indexed
Model training	4 weeks	Training data, Siamese BiLSTM model
Embedding generation	4 weeks	Embeddings for all toponyms
Query integration	4 weeks	Search endpoints
Production rollout	2 weeks	Live system

Total: ~20 weeks from infrastructure provisioning to production

15.5. References¶

WHG Place Discussion #81 - Original phonetic search proposal
Epitran - Grapheme-to-phoneme library (training data preparation)
PanPhon - Phonological feature vectors for IPA segments
Elasticsearch kNN Search - Vector search documentation
Siamese Networks for One-Shot Learning - Foundational paper