2. Elasticsearch Management Guide¶
This guide covers the complete lifecycle of the WHG Elasticsearch deployment at University of Pittsburgh CRC, from installation through daily operations.
2.1. Table of Contents¶
2.2. Architecture Overview¶
The WHG Elasticsearch deployment uses a two-instance architecture to separate indexing from query serving:
Component |
Location |
Purpose |
Lifecycle |
|---|---|---|---|
Production ES |
VM (port 9200) |
Live queries, persistent |
Always running |
Staging ES |
Slurm compute node (port 9201) |
Indexing operations, ephemeral |
Per-job only |
Kibana |
VM (port 5601) |
Dashboard and monitoring |
Always running |
2.2.1. Key Design Principles¶
Isolation: Staging instance runs on Slurm workers to protect production VM from indexing workload spikes.
Ephemeral Staging: The staging instance exists only for the duration of indexing jobs. Data lives on fast local NVMe storage and is destroyed when the job ends. Snapshots provide the persistence mechanism.
Snapshot-Based Transfer: Completed indices are transferred from staging to production via snapshot/restore, enabling validation before production exposure and clean rollback capability.
2.2.2. Storage Tiers¶
Mount |
Type |
Use |
Characteristics |
|---|---|---|---|
|
Standard |
Code, binaries, configs, snapshots, source data |
Sequential I/O, high capacity |
|
Flash |
Production ES data |
High IOPS, low latency |
|
NVMe |
Staging ES data |
~870GB, ephemeral, per-job |
2.3. Installation¶
2.3.1. Prerequisites¶
Access to Pitt CRC infrastructure:
SSH access to
<user>@htc.crc.pitt.eduWrite permissions to
/ix1/ishiand/ix3/ishi
2.3.2. First-Time Setup¶
# SSH to CRC
ssh <user>@htc.crc.pitt.edu
# Clone repository (only needed once)
mkdir -p /ix1/ishi
cd /ix1/ishi
git clone git@github.com:WorldHistoricalGazetteer/elastic.git
# Run installation
./elastic/scripts/es.sh -install
The installation script:
Creates directory structure on
/ix1and/ix3Downloads and installs Elasticsearch 9.0.0
Downloads and installs Kibana 9.0.0
Sets up JDK 21.0.1
Makes wrapper script executable
Adds
esalias to.bashrc
2.3.3. Activate the Environment¶
# Activate the alias in current shell
source ~/.bashrc
# Verify installation
es -health
2.3.4. Updating¶
Pull latest code from the repository:
es -update
2.4. Configuration¶
All configuration is centralized in .env at the repository root:
# View current configuration
cat /ix1/ishi/elastic/.env
2.4.1. Key Variables¶
Variable |
Purpose |
Default |
|---|---|---|
|
Base path for persistent storage |
|
|
Base path for flash storage |
|
|
Elasticsearch installation directory |
|
|
Kibana installation directory |
|
|
Java installation directory |
|
|
Production ES endpoint |
|
|
Staging ES port |
|
|
Authority source files |
|
|
Snapshot repository location |
|
|
Bulk indexing batch size |
|
2.4.2. VM Resource Allocation¶
The VM has 32GB RAM allocated as follows:
Resource |
Allocation |
Purpose |
|---|---|---|
ES heap |
15g |
JVM heap ( |
Filesystem cache |
~15g |
OS uses free RAM for caching index files |
System/services |
~2g |
OS, SSH, monitoring |
The 50/50 split between heap and filesystem cache follows standard ES guidance for optimal query performance.
2.5. Storage Architecture¶
2.5.1. Directory Structure¶
/ix1/ishi/
├── es-bin/ # Elasticsearch installation
├── kibana-bin/ # Kibana installation
├── jdk-21.0.1/ # Java installation
├── elastic/ # Git repository
│ ├── .env # Environment configuration
│ ├── scripts/
│ │ └── es.sh # Management wrapper
│ ├── processing/
│ │ ├── es_staging.sbatch # Staging Slurm script
│ │ ├── settings.py # Python settings
│ │ ├── create_indices.py
│ │ ├── deploy_to_production.py
│ │ ├── fetch_authorities.py
│ │ └── ingest_all_authorities.py
│ ├── authorities/ # Authority ingestion scripts
│ ├── schemas/ # Index mappings and pipelines
│ └── toponyms/ # Embedding generation scripts
├── data/
│ └── authorities/ # Source data files
│ ├── gn/ # GeoNames
│ ├── wd/ # Wikidata
│ ├── tgn/ # Getty TGN
│ ├── pl/ # Pleiades
│ ├── gb/ # GB1900
│ ├── un/ # UN Countries
│ ├── osm/ # OpenStreetMap (optional)
│ ├── nl/ # Native Land
│ ├── dp/ # D-PLACE
│ ├── iv/ # Index Villaris
│ └── loc/ # Library of Congress
├── es/
│ ├── logs/ # Production ES logs
│ ├── es.pid # Production ES PID
│ ├── config/ # Production ES config
│ ├── staging-logs/ # Staging Slurm logs
│ └── snapshots/
│ ├── staging/ # Snapshots from staging
│ └── backup/ # Production backups
├── kibana/
│ ├── data/
│ ├── logs/
│ └── kb.pid
└── esinfo/
└── es-staging.env # Staging instance connection info
/ix3/ishi/
└── es/
└── data/ # Production ES data (flash)
$SLURM_SCRATCH/ # Per-job ephemeral
└── es-staging/
├── data/ # Staging ES data
├── logs/
└── config/
2.5.2. Storage Requirements¶
2.5.2.1. Production (/ix3 Flash Storage)¶
Component |
Size |
|---|---|
Places index |
~55 GB |
Toponyms index |
~160 GB |
Working space |
~7 GB |
Steady state |
~222 GB |
Peak (during deployment) |
~437 GB |
Recommended allocation: 750 GB – 1 TB
Peak usage occurs during zero-downtime deployments when both old and new index versions coexist temporarily.
2.5.2.2. Staging (Local NVMe Scratch)¶
Component |
Size |
|---|---|
Available per job |
~870 GB |
Places index (building) |
~55 GB |
Toponyms index (building) |
~160 GB |
Working space |
~30 GB |
Required |
~250 GB |
Staging uses local NVMe at $SLURM_SCRATCH, automatically provisioned when Slurm jobs start and destroyed when they end.
2.6. Production Instance (VM)¶
The production Elasticsearch instance runs persistently on the VM at localhost:9200.
2.6.1. Starting Services¶
Start both Elasticsearch and Kibana:
es -start
Or start individually:
es es-start # Elasticsearch only
es kibana-start # Kibana only
2.6.2. Stopping Services¶
Stop both services:
es -stop
Or stop individually:
es es-stop # Elasticsearch only
es kibana-stop # Kibana only
2.6.3. Restarting Services¶
es -restart # Both services
es es-restart # Elasticsearch only
es kibana-restart # Kibana only
2.6.4. Service Status¶
Check if services are running:
# Check if processes are running
ps aux | grep elasticsearch
ps aux | grep kibana
# Or check PID files
cat /ix1/ishi/es/es.pid
cat /ix1/ishi/kibana/kb.pid
# Check cluster health
curl "http://localhost:9200/_cluster/health?pretty"
2.6.5. Access URLs¶
Service |
URL |
Notes |
|---|---|---|
Elasticsearch |
VM only |
|
Kibana |
VM only |
For remote Kibana access, use SSH tunnel:
ssh -L 5602:localhost:5601 pitt
# Then access: http://localhost:5602
2.6.6. Production Logs¶
Component |
Location |
|---|---|
Elasticsearch logs |
|
Kibana logs |
|
View recent logs:
tail -f /ix1/ishi/es/logs/whg-production.log
tail -f /ix1/ishi/kibana/logs/kibana.log
2.7. Staging Instance (Slurm)¶
The staging Elasticsearch instance is ephemeral — it runs on a Slurm compute node for indexing operations only.
2.7.1. Key Characteristics¶
Aspect |
Detail |
|---|---|
Instance count |
One at a time (all jobs share it) |
Port |
9201 (fixed) |
Data storage |
|
Snapshot storage |
|
Max runtime |
48 hours |
Lifecycle |
Spun up at job start, destroyed at job end |
2.7.2. Starting Staging Instance¶
Important: Use source to export environment variables to your shell.
# SSH to CRC login node
ssh <user>@htc.crc.pitt.edu
# Start staging instance
source /ix1/ishi/elastic/scripts/es.sh -staging-start
This will:
Submit a Slurm job to a compute node
Start Elasticsearch with data on local NVMe
Register the staging snapshot repository
Restore the latest snapshot (if one exists containing both indices), OR
Create empty indices using
create_indices.pyExport environment variables to your shell
2.7.3. Using Staging Instance¶
2.7.3.1. In the Current Shell¶
After starting, environment variables are automatically exported:
# Check connection
echo "ES available at: http://$ES_NODE:$ES_PORT"
# Query cluster health
curl -s "http://$ES_NODE:$ES_PORT/_cluster/health?pretty"
# Check indices
curl -s "http://$ES_NODE:$ES_PORT/_cat/indices?v"
2.7.3.2. In Other Shells or Scripts¶
Source the environment file:
source /ix1/ishi/esinfo/es-staging.env
# Now ES_NODE and ES_PORT are available
curl -s "http://$ES_NODE:$ES_PORT/_cluster/health?pretty"
Print out 10 random documents from the places index:
curl -s "http://$ES_NODE:$ES_PORT/places/_search?size=10&pretty" \
-H 'Content-Type: application/json' -d '{
"query": {
"function_score": {
"query": { "match_all": {} },
"random_score": {}
}
}
}'
Print out 10 random documents from the toponyms index:
curl -s "http://$ES_NODE:$ES_PORT/toponyms/_search?size=10&pretty" \
-H 'Content-Type: application/json' -d '{
"query": {
"function_score": {
"query": { "match_all": {} },
"random_score": {}
}
}
}'
2.7.3.3. In Slurm Batch Jobs¶
Jobs that index against staging should check that staging is running:
#!/bin/bash
#SBATCH --job-name=my-indexing-job
#SBATCH --time=4:00:00
#SBATCH --mem=16G
STAGING_ENV="/ix1/ishi/esinfo/es-staging.env"
if [ ! -f "$STAGING_ENV" ]; then
echo "ERROR: No staging ES instance running"
echo "Start one with: source /ix1/ishi/elastic/scripts/es.sh -staging-start"
exit 1
fi
source "$STAGING_ENV"
echo "Using staging ES at http://$ES_NODE:$ES_PORT"
# Your indexing commands here...
2.7.4. Checking Staging Status¶
# Full status report
source /ix1/ishi/elastic/scripts/es.sh -staging-status
# Health check only
source /ix1/ishi/elastic/scripts/es.sh -staging-health
# View recent logs
source /ix1/ishi/elastic/scripts/es.sh -staging-logs
2.7.5. Creating Snapshots¶
Critical: Snapshots must be created explicitly after completing work. They are not created automatically on shutdown.
2.7.5.1. Create a Checkpoint Snapshot¶
# Load staging connection info
source /ix1/ishi/esinfo/es-staging.env
# Create snapshot with timestamp
SNAPSHOT_NAME="checkpoint_$(date +%Y%m%d_%H%M%S)"
curl -X PUT "http://$ES_NODE:$ES_PORT/_snapshot/staging_repo/$SNAPSHOT_NAME?wait_for_completion=true" \
-H 'Content-Type: application/json' -d '{
"indices": "places,toponyms",
"ignore_unavailable": true,
"include_global_state": false
}'
2.7.5.2. List Existing Snapshots¶
source /ix1/ishi/esinfo/es-staging.env
curl -s "http://$ES_NODE:$ES_PORT/_snapshot/staging_repo/_all?pretty"
2.7.5.3. Delete Old Snapshots¶
source /ix1/ishi/esinfo/es-staging.env
curl -X DELETE "http://$ES_NODE:$ES_PORT/_snapshot/staging_repo/old_snapshot_name"
2.7.5.4. Delete All Snapshots¶
After tearing down staging, you can delete all snapshots to prevent their being reloaded on restart:
rm -rf /ix1/ishi/es/snapshots/staging/*
2.7.6. Stopping Staging Instance¶
Warning: This destroys all data on local NVMe. Create snapshots first!
source /ix1/ishi/elastic/scripts/es.sh -staging-stop
This will:
Prompt for confirmation (data will be lost)
Cancel the Slurm job
Clean up the ephemeral data directory
Remove the environment file
2.7.7. Handling Job Timeouts¶
The staging instance has a 48-hour time limit. If your work might exceed this:
Break work into phases that complete within the time limit
Create explicit snapshots after each phase completes
Start a new staging instance — snapshots restore automatically
If the job times out mid-operation:
Uncommitted work on the local NVMe is lost
The last explicit snapshot is preserved
Start a new instance to continue from the last checkpoint
2.7.8. Staging Logs¶
Slurm job logs are stored persistently:
# List recent logs
ls -lt /ix1/ishi/es/staging-logs/*.out | head -5
# View specific job log
tail -100 /ix1/ishi/es/staging-logs/slurm-JOBID.out
2.9. Index Management¶
2.9.1. Index Schemas¶
The WHG system uses two primary indices:
2.9.1.1. Places Index¶
Core gazetteer records with geometry, types, and relations:
{
"place_id": "keyword", // e.g., "gn:2643743"
"namespace": "keyword", // e.g., "gn", "wd", "pl"
"label": "text", // Primary display name
"toponyms": "keyword[]", // References to toponyms index
"ccodes": "keyword[]", // ISO country codes
"locations": [{
"geometry": "geo_shape", // GeoJSON geometry
"rep_point": "geo_point", // Representative point
"timespans": [{
"start": "integer", // Temporal validity
"end": "integer"
}]
}],
"types": [{
"identifier": "keyword",
"label": "keyword",
"sourceLabel": "keyword"
}],
"relations": [{
"relationType": "keyword", // e.g., "sameAs", "partOf"
"relationTo": "keyword", // Target place_id
"certainty": "float",
"method": "keyword"
}]
}
2.9.1.2. Toponyms Index¶
Unique name@language combinations with phonetic embeddings:
{
"place_id": "keyword", // Parent place reference
"name": "text", // The toponym
"name_lower": "keyword", // Lowercase for exact match
"lang": "keyword", // ISO 639 language code
"embedding_bilstm": "dense_vector[128]", // Phonetic embedding
"suggest": "completion" // Autocomplete
}
2.9.2. Index Settings: Staging vs Production¶
Indices are created with settings optimized for their use case:
Setting |
Staging (Indexing) |
Production (Queries) |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2.9.3. Ingest Pipelines¶
2.9.3.1. Places Pipeline (extract_namespace)¶
Extracts namespace from place_id and sets indexed_at:
{
"processors": [
{
"script": {
"source": "if (ctx.place_id != null && ctx.place_id.contains(':')) { ctx.namespace = ctx.place_id.splitOnToken(':')[0]; }"
}
},
{
"set": {
"field": "indexed_at",
"value": "{{_ingest.timestamp}}"
}
}
]
}
2.9.3.2. Toponyms Pipeline (extract_language)¶
Parses toponym@lang format into separate fields:
{
"processors": [
{
"script": {
"source": "if (ctx.name != null && ctx.name.contains('@')) { String[] parts = ctx.name.splitOnToken('@'); ctx.name = parts[0]; ctx.name_lower = parts[0].toLowerCase(); ctx.lang = parts[1]; }"
}
}
]
}
2.9.4. Viewing Index Information¶
# List all indices
curl "http://localhost:9200/_cat/indices?v"
# Get index mapping
curl "http://localhost:9200/places/_mapping?pretty"
curl "http://localhost:9200/toponyms/_mapping?pretty"
# Get index settings
curl "http://localhost:9200/places/_settings?pretty"
# Check document count by namespace
for ns in gn wd pl tgn gb un; do
count=$(curl -s "http://localhost:9200/places/_count" \
-H 'Content-Type: application/json' \
-d "{\"query\": {\"prefix\": {\"place_id\": \"$ns:\"}}}" | jq .count)
echo "$ns: $count"
done
2.10. Snapshot Management¶
Snapshots are the primary mechanism for:
Transferring indices from staging to production
Backup and disaster recovery
Index versioning and rollback
2.10.1. Snapshot Repository¶
Snapshots are stored on /ix1/ishi/es/snapshots/:
# Repository structure
/ix1/ishi/es/snapshots/
├── staging/ # Staging snapshots (transfer to production)
└── backup/ # Production backups
2.10.2. Creating Snapshots¶
2.10.2.1. From Staging¶
source /ix1/ishi/esinfo/es-staging.env
# Create snapshot
SNAPSHOT_NAME="complete_$(date +%Y%m%d)"
curl -X PUT "http://$ES_NODE:$ES_PORT/_snapshot/staging_repo/$SNAPSHOT_NAME?wait_for_completion=true" \
-H 'Content-Type: application/json' -d '{
"indices": "places,toponyms",
"ignore_unavailable": true,
"include_global_state": false
}'
2.10.2.2. From Production¶
# Create backup snapshot
SNAPSHOT_NAME="backup_$(date +%Y%m%d)"
curl -X PUT "http://localhost:9200/_snapshot/backup_repo/$SNAPSHOT_NAME?wait_for_completion=true" \
-H 'Content-Type: application/json' -d '{
"indices": "places,toponyms",
"ignore_unavailable": true,
"include_global_state": false
}'
2.10.3. Listing Snapshots¶
# List all snapshots in staging repository
curl -s "http://localhost:9200/_snapshot/staging_repo/_all?pretty"
# List all snapshots in backup repository
curl -s "http://localhost:9200/_snapshot/backup_repo/_all?pretty"
# Get details of specific snapshot
curl -s "http://localhost:9200/_snapshot/staging_repo/complete_20241216?pretty"
2.10.4. Deleting Snapshots¶
# Delete specific snapshot
curl -X DELETE "http://localhost:9200/_snapshot/staging_repo/old_snapshot_name"
# Delete snapshots older than 30 days (script this)
# List snapshots, parse dates, delete old ones
2.10.5. Retention Policy¶
Type |
Schedule |
Retention |
Purpose |
|---|---|---|---|
Checkpoint |
After each authority |
7 days rolling |
Progress preservation |
Daily |
Automatic |
7 days rolling |
Rapid recovery |
Weekly |
Automatic |
4 weeks rolling |
Medium-term rollback |
Pre-deployment |
Before alias switch |
2 per index |
Deployment rollback |
Monthly |
Manual |
6 months |
Long-term archive |
2.11. Production Deployment¶
After completing all indexing and creating a final snapshot on staging, deploy to production.
2.11.1. Deployment Process¶
Important: Run this on the VM, not the CRC login node.
2.11.1.1. 1. Stop Staging Instance¶
source /ix1/ishi/elastic/scripts/es.sh -staging-stop
2.11.1.2. 2. Deploy to Production¶
cd /ix1/ishi/elastic
python -m processing.deploy_to_production
This script will:
Find the latest staging snapshot
Restore to new timestamped indices (e.g.,
places_20241216,toponyms_20241216)Reconfigure index settings for production queries:
refresh_interval:-1→1s(enable near real-time search)translog.durability:async→request(data safety)translog.flush_threshold_size:1gb→512mb(bounded recovery)
Run force merge to 1 segment per shard (~30-60 minutes for query optimization)
Atomically switch aliases (
places,toponyms) to new indicesOptionally clean up old indices
2.11.1.3. 3. Verify Production¶
# Check indices
curl -s "http://localhost:9200/_cat/indices?v"
# Verify document counts
curl -s "http://localhost:9200/places/_count?pretty"
curl -s "http://localhost:9200/toponyms/_count?pretty"
# Check aliases
curl -s "http://localhost:9200/_cat/aliases?v"
# Test a sample query
curl -s "http://localhost:9200/places/_search?q=label:London&size=5&pretty"
2.11.2. Zero-Downtime Deployment¶
The deployment uses versioned indices with alias switching:
Steady state:
places (alias) → places_20241201 (index)
toponyms (alias) → toponyms_20241201 (index)
During restore:
places (alias) → places_20241201 (still serving queries)
toponyms (alias) → toponyms_20241201
[places_20241216, toponyms_20241216 being restored]
After validation and alias switch:
places (alias) → places_20241216 (atomic switch)
toponyms (alias) → toponyms_20241216
[places_20241201, toponyms_20241201 retained for rollback]
After verification:
places_20241201, toponyms_20241201 deleted
2.11.3. Manual Deployment Steps¶
If you need more control than the automated script provides:
2.11.3.1. 1. List Available Snapshots¶
curl -s "http://localhost:9200/_snapshot/staging_repo/_all?pretty" | \
python3 -c "import sys,json; [print(s['snapshot'], s['state'], s['start_time']) for s in json.load(sys.stdin)['snapshots']]"
2.11.3.2. 2. Restore Snapshot¶
SNAPSHOT_NAME="complete_20241216"
TIMESTAMP=$(date +%Y%m%d)
curl -X POST "http://localhost:9200/_snapshot/staging_repo/$SNAPSHOT_NAME/_restore?wait_for_completion=true" \
-H 'Content-Type: application/json' -d "{
\"indices\": \"places,toponyms\",
\"rename_pattern\": \"(.+)\",
\"rename_replacement\": \"\$1_${TIMESTAMP}\",
\"ignore_unavailable\": true,
\"include_global_state\": false
}"
2.11.3.3. 3. Configure for Production¶
# Update settings for query workload
curl -X PUT "http://localhost:9200/places_${TIMESTAMP}/_settings" \
-H 'Content-Type: application/json' -d '{
"index": {
"refresh_interval": "1s",
"translog": {
"durability": "request",
"flush_threshold_size": "512mb"
}
}
}'
curl -X PUT "http://localhost:9200/toponyms_${TIMESTAMP}/_settings" \
-H 'Content-Type: application/json' -d '{
"index": {
"refresh_interval": "1s",
"translog": {
"durability": "request",
"flush_threshold_size": "512mb"
}
}
}'
2.11.3.4. 4. Force Merge (Optional but Recommended)¶
# Optimize for queries (takes 30-60 minutes)
curl -X POST "http://localhost:9200/places_${TIMESTAMP}/_forcemerge?max_num_segments=1"
curl -X POST "http://localhost:9200/toponyms_${TIMESTAMP}/_forcemerge?max_num_segments=1"
2.11.3.5. 5. Validate Restored Indices¶
# Check document counts
curl -s "http://localhost:9200/places_${TIMESTAMP}/_count?pretty"
curl -s "http://localhost:9200/toponyms_${TIMESTAMP}/_count?pretty"
# Run test queries
curl -s "http://localhost:9200/places_${TIMESTAMP}/_search?q=label:London&size=5&pretty"
2.11.3.6. 6. Switch Aliases¶
# Atomic alias switch
curl -X POST "http://localhost:9200/_aliases" -H 'Content-Type: application/json' -d "{
\"actions\": [
{\"remove\": {\"index\": \"places_*\", \"alias\": \"places\"}},
{\"add\": {\"index\": \"places_${TIMESTAMP}\", \"alias\": \"places\"}},
{\"remove\": {\"index\": \"toponyms_*\", \"alias\": \"toponyms\"}},
{\"add\": {\"index\": \"toponyms_${TIMESTAMP}\", \"alias\": \"toponyms\"}}
]
}"
2.11.3.7. 7. Verify Production¶
# Queries now use new indices via aliases
curl -s "http://localhost:9200/places/_count?pretty"
curl -s "http://localhost:9200/_cat/aliases?v"
2.11.3.8. 8. Clean Up Old Indices (After Confirmation)¶
# List indices to confirm which to delete
curl "http://localhost:9200/_cat/indices?v"
# Delete old indices (wait 7 days for rollback safety)
curl -X DELETE "http://localhost:9200/places_20241201"
curl -X DELETE "http://localhost:9200/toponyms_20241201"
2.11.4. Rollback Procedure¶
If deployment fails or issues are discovered:
# Switch aliases back to previous indices
PREVIOUS_DATE="20241201"
curl -X POST "http://localhost:9200/_aliases" -H 'Content-Type: application/json' -d "{
\"actions\": [
{\"remove\": {\"index\": \"places_*\", \"alias\": \"places\"}},
{\"add\": {\"index\": \"places_${PREVIOUS_DATE}\", \"alias\": \"places\"}},
{\"remove\": {\"index\": \"toponyms_*\", \"alias\": \"toponyms\"}},
{\"add\": {\"index\": \"toponyms_${PREVIOUS_DATE}\", \"alias\": \"toponyms\"}}
]
}"
# Verify rollback
curl "http://localhost:9200/_cat/aliases?v"
curl -s "http://localhost:9200/places/_count?pretty"
2.12. Health Monitoring¶
2.12.1. Cluster Health Checks¶
2.12.1.1. Production Health¶
es -health
This shows:
Elasticsearch and Kibana running status
Cluster health (green/yellow/red)
Index summary with document counts
Disk usage (production data and snapshots)
Memory usage
2.12.1.2. Staging Health¶
es -staging-health
This shows:
Staging instance status
Slurm job status
Cluster health
Index summary
Snapshot count
2.12.2. Manual Health Checks¶
# Cluster health
curl "http://localhost:9200/_cluster/health?pretty"
# Detailed cluster state
curl "http://localhost:9200/_cluster/state?pretty"
# Node information
curl "http://localhost:9200/_nodes?pretty"
# Index health
curl "http://localhost:9200/_cat/indices?v&h=index,health,status,docs.count,store.size"
# Disk usage
curl "http://localhost:9200/_cat/allocation?v"
df -h /ix3/ishi/es/data/
du -sh /ix1/ishi/es/snapshots/
2.12.3. Key Metrics to Monitor¶
Metric |
Target |
Alert Threshold |
|---|---|---|
Cluster status |
green |
yellow (warn), red (critical) |
Heap usage |
<75% |
>85% |
Disk usage (/ix3) |
<80% |
>90% |
Search latency (p95) |
<100ms |
>500ms |
Document counts |
expected ±1% |
>5% deviation |
2.12.4. Log Monitoring¶
# Production logs
tail -f /ix1/ishi/es/logs/whg-production.log
# Staging logs (while job running)
source /ix1/ishi/esinfo/es-staging.env
tail -f /ix1/ishi/es/staging-logs/slurm-${SLURM_JOB_ID}.out
# Search for errors
grep ERROR /ix1/ishi/es/logs/whg-production.log | tail -20
2.13. Troubleshooting¶
2.13.1. Staging Won’t Start¶
Symptoms: Slurm job submitted but no environment file created
Check:
Slurm job status:
squeue -u $USERRecent log files:
ls -lt /ix1/ishi/es/staging-logs/*.out | head -5View log:
tail -100 /ix1/ishi/es/staging-logs/slurm-JOBID.out
Common causes:
Insufficient resources available in Slurm queue
Local NVMe scratch not available
Elasticsearch binary not found
Java not found in PATH
Solutions:
# Check environment file exists
cat /ix1/ishi/elastic/.env | head -20
# Verify binaries
ls -la /ix1/ishi/es-bin/bin/elasticsearch
ls -la /ix1/ishi/jdk-21.0.1/bin/java
# Check Slurm queue
squeue -p htc
# Cancel and restart
scancel JOBID
source /ix1/ishi/elastic/scripts/es.sh -staging-start
2.13.2. Stale Staging Environment File¶
Symptoms: Environment file exists but job is not running
Solution:
# Remove stale file
rm /ix1/ishi/esinfo/es-staging.env
# Start fresh
source /ix1/ishi/elastic/scripts/es.sh -staging-start
2.13.3. Staging Out of Space¶
Symptoms: Indexing job crashes with disk full errors
Check:
source /ix1/ishi/esinfo/es-staging.env
ssh $ES_NODE "df -h $SLURM_SCRATCH"
Solutions:
Create snapshot of current state
Stop staging
Request more scratch space (if available)
Or split work into smaller batches
2.13.4. Connection Refused to Staging¶
Symptoms: Cannot connect to http://$ES_NODE:$ES_PORT
Check:
# Verify job still running
source /ix1/ishi/esinfo/es-staging.env
squeue -j $SLURM_JOB_ID
# Check if ES is listening
ssh $ES_NODE "netstat -tuln | grep $ES_PORT"
# Check ES logs
ssh $ES_NODE "tail -100 $SLURM_SCRATCH/es-staging/logs/elasticsearch.log"
Solutions:
Wait longer (ES can take 1-2 minutes to start)
Check for port conflicts
Review Slurm job logs for startup errors
Restart staging instance
2.13.5. Production Out of Memory¶
Symptoms: Queries are slow, heap usage >85%
Check:
curl "http://localhost:9200/_nodes/stats/jvm?pretty"
curl "http://localhost:9200/_cat/thread_pool?v"
Solutions:
Check for long-running queries:
curl "http://localhost:9200/_tasks?detailed"Clear field data cache:
curl -X POST "http://localhost:9200/_cache/clear?fielddata=true"Review query patterns for optimization opportunities
Consider increasing heap size in
.env(requires restart)
2.13.6. Snapshot Restore Fails¶
Symptoms: Restore operation fails or produces incomplete indices
Check:
# Get restore status
curl "http://localhost:9200/_snapshot/staging_repo/snapshot_name/_status?pretty"
# Check snapshot integrity
curl "http://localhost:9200/_snapshot/staging_repo/snapshot_name?pretty"
Solutions:
Verify snapshot completed successfully
Check disk space on production
Ensure repository is accessible:
ls -la /ix1/ishi/es/snapshots/staging/Try restoring specific indices one at a time
Check for index.blocks settings
2.13.7. Ingestion Scripts Crash¶
Symptoms: Python scripts fail during indexing
All ingestion scripts:
Process line-by-line from source files
Can be safely restarted
Elasticsearch handles duplicate IDs (updates existing documents)
No need to delete indices and start over
Solutions:
Check available memory:
free -hReview script logs for specific errors
Reduce batch size in
processing/settings.pyRestart script — it will resume where it left off
2.13.8. Force Merge Takes Too Long¶
Symptoms: Force merge runs for hours
This is normal:
Force merge consolidates segments for optimal query performance
For ~200GB of data, expect 30-60 minutes per index
Progress is not logged incrementally
Check if it’s actually running:
# Check running tasks
curl "http://localhost:9200/_cat/tasks?v"
# Check merge stats
curl "http://localhost:9200/_cat/indices?v&h=index,merges.current"
If stuck:
It’s probably not stuck, just slow
Let it finish naturally
If you must cancel: restart ES (merge will resume on next startup)
2.13.9. Disk Space Issues¶
2.13.9.1. Production (/ix3)¶
# Check usage
df -h /ix3/ishi/es/data/
# Check index sizes
curl "http://localhost:9200/_cat/indices?v&h=index,store.size"
# Delete old indices (if safe)
curl -X DELETE "http://localhost:9200/old_index_name"
2.13.9.2. Snapshots (/ix1)¶
# Check usage
du -sh /ix1/ishi/es/snapshots/*
# List snapshots
curl -s "http://localhost:9200/_snapshot/staging_repo/_all?pretty"
# Delete old snapshots
curl -X DELETE "http://localhost:9200/_snapshot/staging_repo/old_snapshot"
2.14. Quick Reference¶
2.14.1. Essential Commands¶
# Production (run on VM)
es -start # Start Elasticsearch + Kibana
es -stop # Stop both services
es -restart # Restart both services
es -health # Full health check
# Staging (run on CRC login node, use 'source')
source es.sh -staging-start # Launch staging instance
source es.sh -staging-stop # Stop staging instance
source es.sh -staging-health # Health check
source es.sh -staging-status # Status and document counts
source es.sh -staging-logs # View recent logs
# Ingestion
es -ingest # Ingest all authorities
es -ingest -n gn,wd # Specific authorities only
es -ingest --skip-existing # Skip already indexed
es -ingest --check-only # Check data availability
# Update authority files
python -m processing.fetch_authorities
python -m processing.fetch_authorities -n gn,wd --age 0
2.14.2. Common Queries¶
# Load staging connection (if needed)
source /ix1/ishi/esinfo/es-staging.env
# Cluster health
curl "http://localhost:9200/_cluster/health?pretty"
# List indices
curl "http://localhost:9200/_cat/indices?v"
# Document counts
curl "http://localhost:9200/places/_count?pretty"
curl "http://localhost:9200/toponyms/_count?pretty"
# Count by namespace
for ns in gn wd pl tgn gb un; do
count=$(curl -s "http://localhost:9200/places/_count" \
-H 'Content-Type: application/json' \
-d "{\"query\": {\"prefix\": {\"place_id\": \"$ns:\"}}}" | jq .count)
echo "$ns: $count"
done
# Check aliases
curl "http://localhost:9200/_cat/aliases?v"
# Sample search
curl -s "http://localhost:9200/places/_search?q=label:London&size=5&pretty"
2.14.3. Snapshot Operations¶
# Create staging snapshot
source /ix1/ishi/esinfo/es-staging.env
curl -X PUT "http://$ES_NODE:$ES_PORT/_snapshot/staging_repo/checkpoint_$(date +%Y%m%d)?wait_for_completion=true" \
-H 'Content-Type: application/json' -d '{"indices": "places,toponyms"}'
# List snapshots
curl -s "http://localhost:9200/_snapshot/staging_repo/_all?pretty"
# Delete snapshot
curl -X DELETE "http://localhost:9200/_snapshot/staging_repo/snapshot_name"
2.14.4. Deployment¶
# Stop staging
source es.sh -staging-stop
# Deploy to production (run on VM)
cd /ix1/ishi/elastic
python -m processing.deploy_to_production
# Verify
curl -s "http://localhost:9200/_cat/indices?v"
curl -s "http://localhost:9200/places/_count?pretty"
curl -s "http://localhost:9200/_cat/aliases?v"
2.14.5. File Locations¶
Item |
Location |
|---|---|
ES wrapper script |
|
Environment config |
|
Production data |
|
Production logs |
|
Staging logs |
|
Snapshots |
|
Authority files |
|
Staging connection info |
|
2.14.6. Expected Results¶
After complete ingestion:
Places: ~25-30 million documents
Toponyms: ~80 million unique documents
Production storage: 180-270 GB
Snapshot storage: ~100 GB
2.14.7. URLs¶
Service |
URL |
Access |
|---|---|---|
Production ES |
VM only |
|
Staging ES |
http://$ES_NODE:9201 |
Compute node only |
Kibana |
VM only (tunnel via SSH) |
2.14.8. Support¶
For issues or questions:
Review logs:
/ix1/ishi/es/logs/or/ix1/ishi/es/staging-logs/Check GitHub issues: https://github.com/WorldHistoricalGazetteer/elastic
Contact: Karl Grossner (WHG Project)
Last Updated: December 2024
Elasticsearch Version: 9.0.0
Infrastructure: University of Pittsburgh CRC