15. Summary

This architecture provides a scalable, multilingual, phonetic-aware search system for the World Historical Gazetteer. The system is designed to handle approximately 40 million places and 80 million unique toponyms with sub-100ms query latency.

15.1. Key Design Decisions

15.1.1. Two-Instance Architecture

Staging and production Elasticsearch instances are separated to protect query performance:

  • Production (VM, /ix3): Persistent; serves live queries; receives completed indices via snapshot restore

  • Staging (Slurm worker): Ephemeral; spun up for indexing jobs; destroyed when job completes

The staging instance exists only for the duration of indexing operations. Snapshots written to /ix1 before job completion are the sole mechanism for persisting indexed data.

15.1.2. Storage Allocation

Storage is allocated by I/O requirements:

  • Flash storage (/ix3): 750GB - 1TB for production Elasticsearch indices

  • Bulk storage (/ix1): 1TB for authority files and snapshots

  • Local NVMe scratch: ~870GB for staging ES (automatically provisioned per Slurm job)

15.1.3. Two-Index Architecture

  • places index: Core gazetteer records with geometry and metadata; references toponyms by name@lang

  • toponyms index: Unique name@language combinations with phonetic embeddings

The toponyms index stores each unique name@language combination once, regardless of how many places share it. This optimises embedding generation and storage.

15.1.4. Siamese BiLSTM Embeddings

Character-level bidirectional LSTM trained with Siamese architecture generates 128-dimensional vectors:

  • Learns phonetic similarity from positive/negative toponym pairs

  • Processes text directly (no IPA required at query time)

  • Generalises across scripts and languages

  • Enables efficient kNN search via Elasticsearch HNSW

15.1.5. Dual Data Sources

The system indexes both authority files and WHG-contributed datasets:

Source

Places

Processing Location

Authority files

~39M

Staging (Slurm worker)

WHG contributions

~200K

Staging or production

Unique toponyms: ~80M (deduplicated across all sources)

Both sources share the same indices and are searchable together. New contributions only require embedding generation for genuinely new name@language combinations.

15.1.6. Snapshot-Based Deployment

Zero-downtime updates through staged indexing:

  1. Build and validate indices on staging

  2. Create snapshot to shared repository (/ix1)

  3. Restore snapshot to production

  4. Switch aliases atomically

  5. Rollback capability via previous snapshots

15.1.7. Graceful Degradation

Multi-stage search with fallbacks:

  1. Vector kNN search (primary)

  2. Fuzzy text search (fallback)

  3. Completion suggester (autocomplete)

15.2. Technology Stack

Component

Technology

Search engine

Elasticsearch 9.x

Vector search

HNSW via dense_vector

Training data prep

Epitran, PanPhon

Embedding model

PyTorch Siamese BiLSTM

Inference runtime

ONNX (VM), PyTorch+CUDA (compute)

Storage

Pitt CRC /ix1, /ix3

15.3. Storage Requirements

System

Allocation

Purpose

/ix3 (flash)

750GB - 1TB

Production Elasticsearch data

/ix1 (bulk)

1TB

Authority files, snapshots, training data

Local NVMe scratch

~870GB available

Staging ES (per Slurm job)

15.4. Timeline

Phase

Duration

Deliverables

Infrastructure

2 weeks

Storage, Elasticsearch

Core indexing

4 weeks

All authority sources indexed

Model training

4 weeks

Training data, Siamese BiLSTM model

Embedding generation

4 weeks

Embeddings for all toponyms

Query integration

4 weeks

Search endpoints

Production rollout

2 weeks

Live system

Total: ~20 weeks from infrastructure provisioning to production

15.5. References