1.10. Monitoring & Observability

1.10.1. Key Metrics

1.10.1.1. Elasticsearch Cluster

Metric

Target

Alert Threshold

Cluster status

green

yellow (warn), red (critical)

Heap usage

<75%

>85%

Disk usage

<80%

>90%

Search latency (p95)

<100ms

>500ms

Indexing rate

stable

>50% deviation

1.10.1.2. Index Health

Metric

Target

Alert Threshold

Document count

expected ±1%

>5% deviation

Index size

expected ±10%

>25% deviation

Segment count

<50 per shard

>100

kNN search recall@10

>85%

<75%

1.10.1.3. Query Performance

Metric

Target

Alert Threshold

Vector search latency (p95)

<50ms

>200ms

Text search latency (p95)

<30ms

>100ms

Completion suggest latency (p95)

<10ms

>50ms

Query error rate

<0.1%

>1%

1.10.1.4. Embedding Generation

Metric

Target

Alert Threshold

Batch processing rate

>10k/min

<1k/min

Model inference latency

<10ms

>50ms

Embedding coverage

>99%

<95%

1.10.2. Health Check Endpoints

1.10.2.1. Cluster Health

curl -X GET "localhost:9200/_cluster/health?pretty"

1.10.2.2. Index Statistics

curl -X GET "localhost:9200/_cat/indices?v&h=index,health,docs.count,store.size"

1.10.2.3. Search Performance

curl -X GET "localhost:9200/_nodes/stats/indices/search?pretty"

1.10.3. Log Locations

Component

Location

Elasticsearch

/ix3/whcdh/es/logs/

Kibana

/ix1/whcdh/kibana/logs/

Ingestion scripts

stdout/stderr (capture to file)

Embedding generation

/ix1/whcdh/elastic/logs/

1.10.4. Dashboards

1.10.4.2. Sample Kibana Index Pattern

{
  "title": "places-*",
  "timeFieldName": "indexed_at"
}

1.10.5. Alerting

1.10.5.1. Critical Alerts (immediate response)

  • Cluster status red

  • Disk usage >95%

  • Search error rate >5%

  • All nodes unreachable

1.10.5.2. Warning Alerts (investigate within hours)

  • Cluster status yellow

  • Heap usage >85%

  • Search latency p95 >500ms

  • Document count deviation >5%

1.10.5.3. Info Alerts (review daily)

  • New index version deployed

  • Snapshot completed

  • Embedding generation finished

  • Model training completed

1.10.6. Runbook: Common Issues

1.10.6.1. High Search Latency

  1. Check cluster health and node status

  2. Review heap usage (may need GC tuning)

  3. Check segment counts (may need force merge)

  4. Review slow query log for problematic patterns

1.10.6.2. Index Size Growth

  1. Verify expected from new ingestion

  2. Check for duplicate documents

  3. Review field mappings for unexpected data

  4. Consider adjusting refresh interval during bulk indexing

1.10.6.3. kNN Search Quality Degradation

  1. Verify embedding_bilstm field populated

  2. Check model version consistency

  3. Review HNSW parameters

  4. Consider re-indexing with optimised settings