9. Monitoring & Observability¶
9.1. Key Metrics¶
Elasticsearch (DigitalOcean):
Query latency by stage (vector search, n-gram search, join operations).
Cache hit rate for IPA queries.
Index size and growth rate (GB/month).
kNN search recall@k (requires manual evaluation dataset).
Django:
Epitran conversion success rate per language.
ONNX model inference latency (p50, p95, p99).
Model memory footprint per Django worker.
Query fallback frequency (vector → n-gram → fuzzy).
User search satisfaction (requires click-through tracking).
Pitt Pipeline:
Bulk push success rate (target: >99.9%).
Retry queue depth (alert if >1000).
Embedding generation throughput (docs/hour).
Model training convergence (validation loss).
9.2. Dashboards¶
Grafana (DigitalOcean):
Real-time search performance.
Elasticsearch cluster health.
IPA cache efficiency.
Local monitoring (Pitt):
Bulk operation logs.
Model training metrics (TensorBoard).
Data pipeline status.
9.3. Alerting¶
Critical: Elasticsearch cluster red, bulk push failure rate >1%.
Warning: Retry queue >500, IPA conversion failures >5%.
Info: New embedding version deployed, dataset ingestion started.