7. Push-Based Synchronisation Strategy¶
7.1. Rationale¶
Pitt CRC cannot accept inbound connections. All updates must be initiated from Pitt via outbound HTTPS.
7.2. Bulk Update Workflow¶
1. Pitt generates IPA + embeddings (batch of 100k records)
↓
2. Prepare Elasticsearch _bulk API payload:
- NDJSON format
- Include document version numbers
- Split into chunks (5k docs per request)
↓
3. HTTP POST to DigitalOcean Elasticsearch:
- Endpoint: https://whg-es.example.com/_bulk
- Authentication: API key in header
- Timeout: 60s per request
↓
4. Handle responses (see Section 7.4)
↓
5. Verify (see Section 7.5)
7.3. Authentication & Security¶
API Key: Dedicated Elasticsearch API key with restricted permissions:
Write access to
ipa_index,toponym_indexonly.No delete permissions.
Rate limit: 100 req/min.
Network: Whitelist Pitt CRC outbound IP range on Elasticsearch firewall.
Audit log: All _bulk operations logged in Elasticsearch audit trail.
7.4. Resilience Strategy¶
Problem: Network failures, rate limits, or partial updates corrupt the index.
Solution: Robust error handling with retries and checksums.
# Pseudocode for Pitt bulk push script
def push_bulk_update(docs, max_retries=3):
batch_id = uuid4()
checksum = sha256(json.dumps(docs, sort_keys=True))
for attempt in range(max_retries):
try:
response = requests.post(
f"{ES_URL}/_bulk",
headers={"Authorization": f"ApiKey {API_KEY}"},
data=ndjson_format(docs),
timeout=60
)
if response.status_code == 429: # Rate limit
time.sleep(2 ** attempt) # Exponential backoff
continue
result = response.json()
# Check for partial failures
failed_docs = [item for item in result['items']
if item['index']['status'] >= 400]
if failed_docs:
log_failures(batch_id, failed_docs)
# Add to retry queue with exponential backoff
enqueue_retry(failed_docs, delay=2**attempt * 60)
else:
log_success(batch_id, checksum, len(docs))
return True
except requests.exceptions.Timeout:
log_error(f"Timeout on attempt {attempt}")
time.sleep(2 ** attempt)
# All retries failed
alert_admin(batch_id, docs)
return False
Retry queue: Separate persistent queue (SQLite or Redis) for failed documents.
Monitoring: Grafana dashboard tracking:
Successful/failed bulk operations per hour.
Average latency per batch size.
Retry queue depth.
7.5. Verification Strategy¶
After each bulk push:
Count verification:
curl "$ES_URL/ipa_index/_count?q=embedding_version:v4_20251201"
Compare with expected count from Pitt.
Sample verification:
Randomly sample 100
ipa_idvalues.Retrieve from Elasticsearch.
Compare checksums with Pitt source data.
Timestamp check:
Query for documents with
last_updated < expected_date.Alert if any found (indicates missed updates).
Store metadata:
// Pitt maintains sync state { "batch_id": "uuid", "timestamp": "2025-11-17T10:30:00Z", "embedding_version": "v4_20251201", "doc_count": 1234567, "checksum": "sha256_hash", "status": "verified" }