13. Risk Assessment¶
13.1. Technical Risks¶
Risk |
Probability |
Impact |
Mitigation |
|---|---|---|---|
Epitran fails for rare language |
High |
Low |
Affects training data coverage only; model generalises |
Siamese BiLSTM produces poor embeddings |
Low |
High |
Versioned deployment; A/B evaluation; rollback capability |
Production ES memory exhaustion |
Low |
Critical |
Monitor heap usage; tune HNSW parameters; scale vertically |
Staging ES runs out of space |
Low |
Medium |
Monitor during indexing; NVMe scratch provides ~870GB |
Staging job terminates before snapshot |
Medium |
High |
Implement checkpoint snapshots; monitor job time limits |
Index corruption during reindex |
Low |
High |
Staging isolation; snapshot before restore; validation checks |
Storage exhaustion on /ix3 |
Medium |
Critical |
Monitor disk usage; alert at 80%; archive old snapshots |
Query latency exceeds targets |
Medium |
Medium |
Cache frequent queries; tune HNSW; staging allows tuning without production impact |
Snapshot transfer too slow |
Low |
Medium |
Schedule during off-peak; consider incremental snapshots |
13.2. Operational Risks¶
Risk |
Probability |
Impact |
Mitigation |
|---|---|---|---|
Authority source unavailable |
Medium |
Low |
Cache downloaded files; retry with backoff; multiple mirrors |
Siamese model training fails to converge |
Low |
Medium |
Checkpoint frequently; early stopping; hyperparameter search |
Snapshot restore fails |
Low |
High |
Test restores regularly; maintain multiple snapshot generations |
/ix3 storage system failure |
Low |
Critical |
Snapshot to /ix1; document recovery procedure |
Slurm staging unavailable |
Medium |
Medium |
Queue jobs during available windows; document job specifications |
Staging/production version mismatch |
Low |
Medium |
Document ES versions; test compatibility; version indices |
13.3. Data Quality Risks¶
Risk |
Probability |
Impact |
Mitigation |
|---|---|---|---|
Duplicate places across authorities |
High |
Medium |
Deduplication via relations; clustering algorithms |
Training data clustering errors |
Medium |
Medium |
Iterative refinement; manual spot-checks; hold-out validation |
Stale authority data |
Medium |
Low |
Scheduled refresh; track source update dates |
Inconsistent language tagging |
High |
Medium |
Normalise on ingest; validate against ISO 639 |
WHG contribution data quality issues |
Medium |
Medium |
Validation on export; schema enforcement; review workflow |
13.4. Mitigation Strategies¶
13.4.1. Versioned Deployments¶
All index updates use versioned indices with alias switching:
Create
places_v{N+1},toponyms_v{N+1}Populate and validate new indices
Switch aliases atomically
Retain previous version for rollback (7 days)
Delete old indices after confirmation
13.4.2. Snapshot Strategy¶
Type |
Frequency |
Retention |
Purpose |
|---|---|---|---|
Daily |
Automatic |
7 days |
Quick recovery |
Weekly |
Automatic |
4 weeks |
Medium-term rollback |
Pre-deployment |
Before alias switch |
2 versions |
Deployment rollback |
Monthly |
Manual |
6 months |
Archive |
13.4.3. Graceful Degradation¶
Search always returns results through fallback chain:
Vector kNN search (best quality)
Fuzzy text search (good quality)
Exact text search (baseline)
13.4.4. Monitoring and Alerting¶
Cluster health checks every 60 seconds
Index size and document count verification daily
Search latency monitoring with percentile alerts
Automated snapshot verification weekly