1.3. WHG Elasticsearch Infrastructure: Technical Summary¶

1.3.1. Overview¶

The World Historical Gazetteer (WHG) Elasticsearch deployment is designed to index and serve place data from multiple authority sources, supporting both text-based and phonetic similarity search across approximately 40 million places and 80 million unique toponyms.

This document summarises the infrastructure architecture, storage requirements, and deployment strategy.

1.3.2. Architecture¶

1.3.2.1. Two-Instance Deployment¶

The system uses two Elasticsearch instances to separate indexing from query serving:

Instance	Location	Purpose	Lifecycle
Production	VM on /ix3 (flash)	Live query serving, alias management	Persistent
Staging	Slurm worker	Bulk indexing, embedding enrichment	Ephemeral (per-job)

The staging instance is ephemeral: it is spun up at the start of a Slurm job, runs for the duration of indexing operations, and is automatically cleaned up when the job completes. Only the snapshots written to /ix1 persist.

Both instances serve the same two primary indices:

places: Core place records with geometry, type classifications, and cross-references
toponyms: Unique name@language combinations with phonetic embeddings (deduplicated across all places)

1.3.2.2. Rationale for Separation¶

Bulk indexing of tens of millions of records consumes significant CPU, memory, and I/O. Running this workload on the production VM would:

Degrade query latency during indexing
Risk VM unresponsiveness under heavy load
Prevent pre-production validation

The ephemeral staging instance on a Slurm worker handles the heavy lifting. It exists only for the duration of the indexing job, with snapshots providing a clean transfer mechanism to production.

1.3.2.3. Storage Tiers¶

Storage is distributed across systems optimised for different I/O patterns:

System	Purpose	Characteristics
`/ix3` (flash)	Production Elasticsearch data	High IOPS, low latency random access
Local NVMe scratch	Staging Elasticsearch data	~870GB, NVMe speeds, per-job ephemeral
`/ix1` (bulk)	Authority files, snapshots	Sequential I/O, high capacity, shared

Staging ES uses local NVMe scratch ($SLURM_SCRATCH) for fast indexing I/O, while snapshots are written to /ix1 for transfer to production.

1.3.3. Authority Data Sources¶

The following authority datasets are indexed:

Authority	Namespace	Est. Places	Source Size
GeoNames	`gn`	12,000,000	600 MB
Wikidata	`wd`	8,000,000	148 GB
OpenStreetMap	`osm`	15,000,000	85 GB
Getty TGN	`tgn`	3,000,000	2 GB
GB1900	`gb`	800,000	100 MB
Pleiades	`pl`	37,000	104 MB
Native Land	`nl`	5,000	50 MB
LOC	`loc`	—	1.5 GB
D-PLACE	`dp`	2,000	10 MB
Index Villaris	`iv`	24,000	5 MB
ISO Countries	`un`	250	15 MB

Authority totals: ~39 million places

Unique toponyms (across all sources): ~80 million name@language combinations

1.3.4. WHG-Contributed Datasets¶

In addition to authority files, the system indexes scholarly datasets contributed by WHG users and partner projects:

Source	Est. Places
WHG contributions	~200,000

These represent the core research content of WHG and are indexed alongside authority data, enabling contributed places to be matched against the full authority corpus.

Combined totals: ~39.2 million places, ~80 million unique toponyms

Note: The toponym count reflects unique name@language combinations. Many toponyms (e.g., “London@en”) are shared by multiple places across different authority sources.

1.3.5. Phonetic Search¶

Toponyms are enriched with phonetic embeddings to support fuzzy name matching across languages and historical spelling variants.

1.3.5.1. Siamese BiLSTM Embeddings¶

Dimensions: 128
Method: Character-level bidirectional LSTM with Siamese training
Index type: HNSW (Hierarchical Navigable Small World)
Similarity: Cosine

The Siamese BiLSTM model learns phonetic similarity directly from pairs of equivalent toponyms (e.g., cross-lingual equivalents from Wikidata). This approach generalises across scripts and languages where IPA transcription models are unavailable.

1.3.5.2. Embedding Generation¶

Embeddings are generated once per unique toponym on the staging instance during indexing:

Authority files: Batch processing on Slurm worker (GPU-accelerated if available)
WHG contributions: New unique toponyms processed on staging or production depending on volume

The trained Siamese BiLSTM encoder is deployed to both instances. This deduplication approach means new contributions only require embedding generation for genuinely new name@language combinations.

1.3.6. Index Schemas¶

1.3.6.1. Places Index¶

{
  "place_id": "keyword",
  "namespace": "keyword",
  "label": "text",
  "toponyms": "keyword[]",
  "ccodes": "keyword[]",
  "locations": [{
    "geometry": "geo_shape",
    "rep_point": "geo_point",
    "timespans": [{ "start": "integer", "end": "integer" }]
  }],
  "types": [{
    "identifier": "keyword",
    "label": "keyword",
    "sourceLabel": "keyword"
  }],
  "relations": [{
    "relationType": "keyword",
    "relationTo": "keyword",
    "certainty": "float"
  }]
}

1.3.6.2. Toponyms Index¶

Each document represents a unique name@language combination:

{
  "toponym_id": "keyword",
  "name": "text",
  "name_lower": "keyword",
  "lang": "keyword",
  "embedding_bilstm": "dense_vector[128]",
  "suggest": "completion"
}

The toponym_id (e.g., “London@en”) serves as both the document ID and the join key referenced by places.

1.3.7. Deployment Strategy¶

1.3.7.1. Snapshot-Based Transfer¶

New indices are built on staging and transferred to production via snapshots:

1. Staging (Slurm worker):
   - Create versioned indices (places_v2, toponyms_v2)
   - Ingest and enrich data
   - Validate document counts and sample queries
   - Create snapshot to /ix1

2. Production (VM):
   - Restore snapshot from /ix1
   - Validate restored indices
   - Switch aliases atomically
   - Delete old indices after confirmation

1.3.7.2. Alias-Based Zero-Downtime Switching¶

Production uses versioned indices with alias switching:

Steady state:
  places (alias) → places_v1 (index)
  toponyms (alias) → toponyms_v1 (index)

After restore and switch:
  places (alias) → places_v2 (restored from snapshot)
  toponyms (alias) → toponyms_v2 (restored from snapshot)
  places_v1, toponyms_v1 retained for rollback (7 days)
  
Switchover:
  places (alias) → places_v2 (atomic)
  places_v1 (deleted after verification)

This approach:

Protects production from indexing workload
Allows validation queries against new indices before promotion
Provides rollback via previous snapshots
Requires temporary 2× index storage only on production during alias transition

1.3.8. Storage Requirements¶

1.3.8.1. /ix3 (Flash Storage) — Production Elasticsearch Data¶

Component	Size
Places index	55 GB
Toponyms index (with BiLSTM embeddings)	160 GB
Transition overhead (old + new indices)	215 GB
Logs, Kibana data	7 GB
Peak (during alias transition)	437 GB
Steady state	222 GB

Recommended allocation: 750 GB – 1 TB

The additional headroom accommodates:

Future authority source additions
Embedding strategy expansion
Index optimisation overhead
Operational flexibility

1.3.8.2. Staging Storage — Slurm Worker (Ephemeral)¶

Component	Size
Available per job	~870 GB
Places index (building)	55 GB
Toponyms index (building)	160 GB
Working space	30 GB
Required	~250 GB

Staging uses local NVMe scratch at $SLURM_SCRATCH (mapped to /scratch/slurm-$SLURM_JOB_ID):

NVMe speeds for fast indexing I/O
Automatically provisioned when Slurm job starts
Automatically destroyed when job ends (indices, logs, and all local data)
No contention with other NFS users

The staging Elasticsearch instance is ephemeral by design. Snapshots must be written to the shared /ix1 repository before the job completes to preserve the indexed data for transfer to production.

1.3.8.3. /ix1 (Bulk Storage) — Authority Files & Snapshots¶

Component	Size
Authority source files (compressed)	240 GB
Working/extraction space	20 GB
Snapshot repository	500 GB
Scripts, logs, configuration	5 GB
Total	765 GB

Recommended allocation: 1 TB

1.3.9. Snapshot Strategy¶

Snapshots are stored on /ix1 using Elasticsearch’s native snapshot mechanism. The repository uses incremental snapshots to minimise storage growth between full reindexing operations.

1.3.9.1. Retention Policy¶

Type	Schedule	Retention	Purpose
Daily	Automatic	7 days rolling	Rapid recovery from corruption
Weekly	Automatic	4 weeks rolling	Medium-term rollback
Monthly	Manual	6 months	Long-term archive
Pre-deployment	Before alias switch	2 per index	Deployment rollback point

1.3.9.2. Repository Configuration¶

# Primary snapshot repository
path.repo: /ix1/whcdh/es/repo

# Repository registration
PUT _snapshot/whg_backup
{
  "type": "fs",
  "settings": {
    "location": "/ix1/whcdh/es/repo/backup"
  }
}

1.3.10. Resource Summary¶

Resource	Allocation	System
Production Elasticsearch	750 GB – 1 TB	/ix3 (flash)
Staging Elasticsearch	~250 GB (of ~870 GB available)	Local NVMe scratch
Authority files + snapshots	1 TB	/ix1 (bulk)
Total persistent storage	1.75 – 2 TB

Note: Staging storage is ephemeral (per Slurm job) and does not require permanent allocation.

1.3.11. Operational Commands¶

1.3.11.1. Index Management¶

# Check cluster health
curl -X GET "localhost:9200/_cluster/health?pretty"

# Check index sizes
curl -X GET "localhost:9200/_cat/indices?v&h=index,docs.count,store.size"

# Check alias mappings
curl -X GET "localhost:9200/_cat/aliases?v"

1.3.11.2. Snapshot Operations¶

# Create manual snapshot
curl -X PUT "localhost:9200/_snapshot/whg_backup/snapshot_$(date +%Y%m%d)?wait_for_completion=true" \
  -H 'Content-Type: application/json' -d '{
    "indices": "places,toponyms",
    "ignore_unavailable": true,
    "include_global_state": false
  }'

# List snapshots
curl -X GET "localhost:9200/_snapshot/whg_backup/_all?pretty"

# Restore snapshot
curl -X POST "localhost:9200/_snapshot/whg_backup/snapshot_name/_restore" \
  -H 'Content-Type: application/json' -d '{
    "indices": "places,toponyms",
    "ignore_unavailable": true
  }'

1.3.11.3. Alias Switching¶

# Atomic alias switch
curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -d '{
  "actions": [
    { "remove": { "index": "places_v1", "alias": "places" }},
    { "add": { "index": "places_v2", "alias": "places" }}
  ]
}'

1.3.12. Directory Structure¶

/ix1/whcdh/
├── elastic/                    # Repository
│   ├── authorities/            # Authority ingestion scripts
│   ├── processing/             # Index management scripts
│   ├── schemas/                # Index mappings and pipelines
│   ├── scripts/                # Operational wrapper scripts
│   └── toponyms/               # Embedding generation scripts
├── data/
│   └── authorities/            # Downloaded authority files
│       ├── gn/                 # GeoNames
│       ├── wd/                 # Wikidata
│       ├── tgn/                # Getty TGN
│       └── ...
└── es/
    └── repo/                   # Snapshot repository
        └── backup/

/ix3/whcdh/
└── es/
    ├── data/                   # Elasticsearch data directory
    └── logs/                   # Elasticsearch logs