13. Risk Assessment

13.1. Technical Risks

Risk

Probability

Impact

Mitigation

Epitran fails for rare language

High

Low

Affects training data coverage only; model generalises

Siamese BiLSTM produces poor embeddings

Low

High

Versioned deployment; A/B evaluation; rollback capability

Production ES memory exhaustion

Low

Critical

Monitor heap usage; tune HNSW parameters; scale vertically

Staging ES runs out of space

Low

Medium

Monitor during indexing; NVMe scratch provides ~870GB

Staging job terminates before snapshot

Medium

High

Implement checkpoint snapshots; monitor job time limits

Index corruption during reindex

Low

High

Staging isolation; snapshot before restore; validation checks

Storage exhaustion on /ix3

Medium

Critical

Monitor disk usage; alert at 80%; archive old snapshots

Query latency exceeds targets

Medium

Medium

Cache frequent queries; tune HNSW; staging allows tuning without production impact

Snapshot transfer too slow

Low

Medium

Schedule during off-peak; consider incremental snapshots

13.2. Operational Risks

Risk

Probability

Impact

Mitigation

Authority source unavailable

Medium

Low

Cache downloaded files; retry with backoff; multiple mirrors

Siamese model training fails to converge

Low

Medium

Checkpoint frequently; early stopping; hyperparameter search

Snapshot restore fails

Low

High

Test restores regularly; maintain multiple snapshot generations

/ix3 storage system failure

Low

Critical

Snapshot to /ix1; document recovery procedure

Slurm staging unavailable

Medium

Medium

Queue jobs during available windows; document job specifications

Staging/production version mismatch

Low

Medium

Document ES versions; test compatibility; version indices

13.3. Data Quality Risks

Risk

Probability

Impact

Mitigation

Duplicate places across authorities

High

Medium

Deduplication via relations; clustering algorithms

Training data clustering errors

Medium

Medium

Iterative refinement; manual spot-checks; hold-out validation

Stale authority data

Medium

Low

Scheduled refresh; track source update dates

Inconsistent language tagging

High

Medium

Normalise on ingest; validate against ISO 639

WHG contribution data quality issues

Medium

Medium

Validation on export; schema enforcement; review workflow

13.4. Mitigation Strategies

13.4.1. Versioned Deployments

All index updates use versioned indices with alias switching:

  1. Create places_v{N+1}, toponyms_v{N+1}

  2. Populate and validate new indices

  3. Switch aliases atomically

  4. Retain previous version for rollback (7 days)

  5. Delete old indices after confirmation

13.4.2. Snapshot Strategy

Type

Frequency

Retention

Purpose

Daily

Automatic

7 days

Quick recovery

Weekly

Automatic

4 weeks

Medium-term rollback

Pre-deployment

Before alias switch

2 versions

Deployment rollback

Monthly

Manual

6 months

Archive

13.4.3. Graceful Degradation

Search always returns results through fallback chain:

  1. Vector kNN search (best quality)

  2. Fuzzy text search (good quality)

  3. Exact text search (baseline)

13.4.4. Monitoring and Alerting

  • Cluster health checks every 60 seconds

  • Index size and document count verification daily

  • Search latency monitoring with percentile alerts

  • Automated snapshot verification weekly