5. Data Flow¶
5.2. WHG-Contributed Dataset Ingestion¶
Scholarly datasets contributed by WHG users follow a similar staged workflow, but with on-the-fly embedding generation due to smaller volumes.
5.2.1. Pipeline¶
1. Export dataset from WHG Django/PostgreSQL
↓
2. Convert to canonical JSON format
- Same schema as authority ingestion
- Preserve dataset provenance and attribution
↓
3. Extract unique toponyms from dataset
- Deduplicate by name@lang key
- Check which toponyms already exist in index
↓
4. For each new unique toponym:
- Generate embedding via Siamese BiLSTM model
↓
5. Index places and new toponyms to staging or production
↓
6. Validate dataset integrity
↓
7. Create snapshot (if staging on Slurm worker)
↓
8. Restore to production / switch aliases
5.2.2. Rationale for Flexible Staging¶
WHG-contributed datasets are much smaller (~200k places) and arrive incrementally. Depending on volume:
Small batches: Can be indexed directly on the VM during low-usage periods
Larger contributions: Use Slurm staging to avoid production impact
New toponyms from contributions are checked against the existing toponyms index; only genuinely new name@lang combinations require embedding generation.
5.3. Embedding Generation¶
Siamese BiLSTM embeddings are generated for unique toponyms only, prior to snapshot transfer.
5.3.2. Incremental Generation (WHG Contributions)¶
Run during ingestion for new unique toponyms only:
# Check which toponyms are genuinely new
new_toponyms = [t for t in dataset_toponyms
if not toponym_exists(es_client, t['toponym_id'])]
# Generate embeddings only for new toponyms
for toponym in new_toponyms:
embedding = siamese_bilstm_model.embed(toponym['name'])
toponym['embedding_bilstm'] = embedding
index_toponym(es_client, toponym)
This deduplication means a contribution of 10,000 places might only require embedding generation for a few hundred genuinely new name@lang combinations.
The same trained Siamese BiLSTM model is deployed to both staging (Slurm worker) and production (VM), ensuring consistent embeddings across the corpus.
5.4. Incremental Updates¶
For new dataset ingestion after initial population:
Ingest new places/toponyms to versioned indices
Generate embeddings for new unique toponyms only
Merge with existing data or create new index version
Switch aliases when ready
Quarterly full re-embedding ensures consistency across the corpus as the model improves.