2. Components¶
2.1. Online Components (DigitalOcean)¶
Django: Provides application logic and API.
PostgreSQL/PostGIS: Stores Places, geometry, and relational structure.
Elasticsearch:
place_index(existing)toponym_index(new, handles many-to-many Place↔Toponym)ipa_index(new, deduplicated IPA forms)
Real-time phonetic query system:
Epitran-based G2P converter for query IPA generation.
Lightweight inference model (quantized/distilled version of trained Siamese BiLSTM) for generating query embeddings on-the-fly.
PanPhon for phonetic feature extraction.
2.2. Offline Components (Pitt CRC)¶
Access to full Wikidata and Geonames datasets.
Virtually unlimited compute for:
Bulk G2P → IPA transcription using Epitran.
PanPhon feature extraction and normalisation.
Training and retraining of Siamese BiLSTM models.
Generating phonetic embeddings.
Outbound-only HTTP access for pushing bulk updates to Elasticsearch.
2.3. Rationale for Separate Indices¶
toponym_index: New index to handle many-to-many relationships (one toponym → many places; one place → many toponyms). This is critical because “Springfield” maps to 50+ distinct places.ipa_index: Global deduplication layer. Stores each unique IPA form once, dramatically reducing embedding storage and computation costs.place_index: Core gazetteer records with geometry and metadata (existing).