10. Monitoring & Observability

10.1. Key Metrics

10.1.1. Elasticsearch Cluster

Metric

Target

Alert Threshold

Cluster status

green

yellow (warn), red (critical)

Heap usage

<75%

>85%

Disk usage

<80%

>90%

Search latency (p95)

<100ms

>500ms

Indexing rate

stable

>50% deviation

10.1.2. Index Health

Metric

Target

Alert Threshold

Document count

expected ±1%

>5% deviation

Index size

expected ±10%

>25% deviation

Segment count

<50 per shard

>100

kNN search recall@10

>85%

<75%

10.1.3. Query Performance

Metric

Target

Alert Threshold

Vector search latency (p95)

<50ms

>200ms

Text search latency (p95)

<30ms

>100ms

Completion suggest latency (p95)

<10ms

>50ms

Query error rate

<0.1%

>1%

10.1.4. Embedding Generation

Metric

Target

Alert Threshold

Batch processing rate

>10k/min

<1k/min

Model inference latency

<10ms

>50ms

Embedding coverage

>99%

<95%

10.2. Health Check Endpoints

10.2.1. Cluster Health

curl -X GET "localhost:9200/_cluster/health?pretty"

10.2.2. Index Statistics

curl -X GET "localhost:9200/_cat/indices?v&h=index,health,docs.count,store.size"

10.2.3. Search Performance

curl -X GET "localhost:9200/_nodes/stats/indices/search?pretty"

10.3. Log Locations

Component

Location

Elasticsearch

/ix3/whcdh/es/logs/

Kibana

/ix1/whcdh/kibana/logs/

Ingestion scripts

stdout/stderr (capture to file)

Embedding generation

/ix1/whcdh/elastic/logs/

10.4. Dashboards

10.4.2. Sample Kibana Index Pattern

{
  "title": "places-*",
  "timeFieldName": "indexed_at"
}

10.5. Alerting

10.5.1. Critical Alerts (immediate response)

  • Cluster status red

  • Disk usage >95%

  • Search error rate >5%

  • All nodes unreachable

10.5.2. Warning Alerts (investigate within hours)

  • Cluster status yellow

  • Heap usage >85%

  • Search latency p95 >500ms

  • Document count deviation >5%

10.5.3. Info Alerts (review daily)

  • New index version deployed

  • Snapshot completed

  • Embedding generation finished

  • Model training completed

10.6. Runbook: Common Issues

10.6.1. High Search Latency

  1. Check cluster health and node status

  2. Review heap usage (may need GC tuning)

  3. Check segment counts (may need force merge)

  4. Review slow query log for problematic patterns

10.6.2. Index Size Growth

  1. Verify expected from new ingestion

  2. Check for duplicate documents

  3. Review field mappings for unexpected data

  4. Consider adjusting refresh interval during bulk indexing

10.6.3. kNN Search Quality Degradation

  1. Verify embedding_bilstm field populated

  2. Check model version consistency

  3. Review HNSW parameters

  4. Consider re-indexing with optimised settings