10. Monitoring & Observability¶
10.1. Key Metrics¶
10.1.1. Elasticsearch Cluster¶
Metric |
Target |
Alert Threshold |
|---|---|---|
Cluster status |
green |
yellow (warn), red (critical) |
Heap usage |
<75% |
>85% |
Disk usage |
<80% |
>90% |
Search latency (p95) |
<100ms |
>500ms |
Indexing rate |
stable |
>50% deviation |
10.1.2. Index Health¶
Metric |
Target |
Alert Threshold |
|---|---|---|
Document count |
expected ±1% |
>5% deviation |
Index size |
expected ±10% |
>25% deviation |
Segment count |
<50 per shard |
>100 |
kNN search recall@10 |
>85% |
<75% |
10.1.3. Query Performance¶
Metric |
Target |
Alert Threshold |
|---|---|---|
Vector search latency (p95) |
<50ms |
>200ms |
Text search latency (p95) |
<30ms |
>100ms |
Completion suggest latency (p95) |
<10ms |
>50ms |
Query error rate |
<0.1% |
>1% |
10.1.4. Embedding Generation¶
Metric |
Target |
Alert Threshold |
|---|---|---|
Batch processing rate |
>10k/min |
<1k/min |
Model inference latency |
<10ms |
>50ms |
Embedding coverage |
>99% |
<95% |
10.2. Health Check Endpoints¶
10.2.1. Cluster Health¶
curl -X GET "localhost:9200/_cluster/health?pretty"
10.2.2. Index Statistics¶
curl -X GET "localhost:9200/_cat/indices?v&h=index,health,docs.count,store.size"
10.2.3. Search Performance¶
curl -X GET "localhost:9200/_nodes/stats/indices/search?pretty"
10.3. Log Locations¶
Component |
Location |
|---|---|
Elasticsearch |
/ix3/whcdh/es/logs/ |
Kibana |
/ix1/whcdh/kibana/logs/ |
Ingestion scripts |
stdout/stderr (capture to file) |
Embedding generation |
/ix1/whcdh/elastic/logs/ |
10.4. Dashboards¶
10.4.1. Recommended Kibana Visualisations¶
Cluster Overview: Node status, heap, disk, CPU
Index Metrics: Document counts, sizes, growth over time
Search Performance: Latency histograms, throughput, error rates
Ingestion Progress: Documents indexed per authority source
10.4.2. Sample Kibana Index Pattern¶
{
"title": "places-*",
"timeFieldName": "indexed_at"
}
10.5. Alerting¶
10.5.1. Critical Alerts (immediate response)¶
Cluster status red
Disk usage >95%
Search error rate >5%
All nodes unreachable
10.5.2. Warning Alerts (investigate within hours)¶
Cluster status yellow
Heap usage >85%
Search latency p95 >500ms
Document count deviation >5%
10.5.3. Info Alerts (review daily)¶
New index version deployed
Snapshot completed
Embedding generation finished
Model training completed
10.6. Runbook: Common Issues¶
10.6.1. High Search Latency¶
Check cluster health and node status
Review heap usage (may need GC tuning)
Check segment counts (may need force merge)
Review slow query log for problematic patterns
10.6.2. Index Size Growth¶
Verify expected from new ingestion
Check for duplicate documents
Review field mappings for unexpected data
Consider adjusting refresh interval during bulk indexing
10.6.3. kNN Search Quality Degradation¶
Verify embedding_bilstm field populated
Check model version consistency
Review HNSW parameters
Consider re-indexing with optimised settings