10. Monitoring & Observability¶

10.1. Key Metrics¶

10.1.1. Elasticsearch Cluster¶

Metric	Target	Alert Threshold
Cluster status	green	yellow (warn), red (critical)
Heap usage	<75%	>85%
Disk usage	<80%	>90%
Search latency (p95)	<100ms	>500ms
Indexing rate	stable	>50% deviation

10.1.2. Index Health¶

Metric	Target	Alert Threshold
Document count	expected ±1%	>5% deviation
Index size	expected ±10%	>25% deviation
Segment count	<50 per shard	>100
kNN search recall@10	>85%	<75%

10.1.3. Query Performance¶

Metric	Target	Alert Threshold
Vector search latency (p95)	<50ms	>200ms
Text search latency (p95)	<30ms	>100ms
Completion suggest latency (p95)	<10ms	>50ms
Query error rate	<0.1%	>1%

10.1.4. Embedding Generation¶

Metric	Target	Alert Threshold
Batch processing rate	>10k/min	<1k/min
Model inference latency	<10ms	>50ms
Embedding coverage	>99%	<95%

10.2. Health Check Endpoints¶

10.2.1. Cluster Health¶

curl -X GET "localhost:9200/_cluster/health?pretty"

10.2.2. Index Statistics¶

curl -X GET "localhost:9200/_cat/indices?v&h=index,health,docs.count,store.size"

10.2.3. Search Performance¶

curl -X GET "localhost:9200/_nodes/stats/indices/search?pretty"

10.3. Log Locations¶

Component	Location
Elasticsearch	/ix3/whcdh/es/logs/
Kibana	/ix1/whcdh/kibana/logs/
Ingestion scripts	stdout/stderr (capture to file)
Embedding generation	/ix1/whcdh/elastic/logs/

10.4. Dashboards¶

10.4.1. Recommended Kibana Visualisations¶

Cluster Overview: Node status, heap, disk, CPU
Index Metrics: Document counts, sizes, growth over time
Search Performance: Latency histograms, throughput, error rates
Ingestion Progress: Documents indexed per authority source

10.4.2. Sample Kibana Index Pattern¶

{
  "title": "places-*",
  "timeFieldName": "indexed_at"
}

10.5. Alerting¶

10.5.1. Critical Alerts (immediate response)¶

Cluster status red
Disk usage >95%
Search error rate >5%
All nodes unreachable

10.5.2. Warning Alerts (investigate within hours)¶

Cluster status yellow
Heap usage >85%
Search latency p95 >500ms
Document count deviation >5%

10.5.3. Info Alerts (review daily)¶

New index version deployed
Snapshot completed
Embedding generation finished
Model training completed

10.6. Runbook: Common Issues¶

10.6.1. High Search Latency¶

Check cluster health and node status
Review heap usage (may need GC tuning)
Check segment counts (may need force merge)
Review slow query log for problematic patterns

10.6.2. Index Size Growth¶

Verify expected from new ingestion
Check for duplicate documents
Review field mappings for unexpected data
Consider adjusting refresh interval during bulk indexing

10.6.3. kNN Search Quality Degradation¶

Verify embedding_bilstm field populated
Check model version consistency
Review HNSW parameters
Consider re-indexing with optimised settings