1.8. Query Pipeline¶
1.8.1. Search Strategy¶
Phonetic search uses a multi-stage approach with graceful fallback:
1. User input: "Springfeld" (typo or variant spelling)
↓
2. Language detection (optional):
- Use UI locale hint
- Or attempt automatic detection
- Default: treat as multilingual
↓
3. Generate query embedding:
- Siamese BiLSTM model encodes query string directly
- Same model used for indexing (consistency)
- No IPA conversion required at query time
- Inference latency target: <10ms
↓
4. Elasticsearch multi-stage search on toponyms index:
Stage A: Vector kNN search (primary)
- Index: toponyms
- Field: embedding_bilstm
- k: 100 candidates
- Similarity: cosine
Stage B: Text search (fallback/boost)
- Fuzzy match on name field
- Edge n-gram on name for prefix matching
- Exact match on name_lower for precise queries
Stage C: Completion suggest (autocomplete)
- Type-ahead suggestions with language context
↓
5. Score combination:
- Weighted blend: 0.7 × vector_score + 0.3 × text_score
- Return top-scoring toponym_ids (name@lang)
↓
6. Join to places:
- Query places index for documents referencing matched toponyms
- Apply geographic/temporal filters if specified
- Deduplicate and rank by best toponym match score
↓
7. Return ranked place results with matched toponym and confidence scores
1.8.2. Elasticsearch Query Structure¶
1.8.2.1. Vector Search¶
{
"knn": {
"field": "embedding_bilstm",
"query_vector": [0.12, -0.34, "..."],
"k": 100,
"num_candidates": 500
},
"_source": ["toponym_id", "name", "lang"]
}
1.8.2.2. Hybrid Search (Vector + Text)¶
{
"query": {
"bool": {
"should": [
{
"knn": {
"field": "embedding_bilstm",
"query_vector": [0.12, -0.34, "..."],
"k": 50,
"boost": 0.7
}
},
{
"match": {
"name": {
"query": "Springfeld",
"fuzziness": "AUTO",
"boost": 0.3
}
}
}
]
}
}
}
1.8.2.3. Completion Suggest¶
{
"suggest": {
"toponym-suggest": {
"prefix": "Lond",
"completion": {
"field": "suggest",
"size": 10,
"contexts": {
"lang": ["en"]
}
}
}
}
}
1.8.3. Spatial Containment Filtering¶
POST /api/search and POST /api/reconcile can filter results to places that
fall inside a region. The region is defined either by a set of existing
place_ids (contained_in — e.g. a country un:ita, or any administrative
polygon such as osm:r…) or by raw GeoJSON (bounds). This powers the Atlas
Area filter and augments the WHG Reconciliation API. The query string is
optional: omit it for a pure-spatial query (“everything in this region”),
or combine it with a query for text + spatial.
Field |
Values |
Meaning |
|---|---|---|
|
|
Region = union of these places’ geometries |
|
GeoJSON |
Region from raw geometry |
|
|
H3-cell test vs true Shapely geometry |
|
|
Any overlap vs candidate fully inside |
1.8.3.1. Two-phase design (H3 prefilter → exact refine)¶
The filter does no Elasticsearch reindex — it reuses fields already
indexed per geometry: repr_point (a representative point guaranteed to lie
within the geometry), h3_cover (a compacted, multi-resolution set of Uber
H3 cells covering the geometry’s extent), and bounds.
Full polygons are not held in Elasticsearch; they live in the geometry
store and are read on demand only for exact tests.
Resolve region — build the region’s H3 cover from the
contained_inplaces’ ownh3_cover(cached across requests; the Atlas pattern reuses the same country repeatedly).Gather candidates — an Elasticsearch pre-filter:
repr_pointintersecting the region bounding box ORh3_covermatching the region’s cells (the latter recovers large polygons that overlap the region far from their representative point).Refine —
fuzzy: test each candidate’s H3 cells for membership in the region cover. Cheap and tolerant (precision is bounded by the coarsest region cell).
exact: test the candidate’s real polygon (loaded from the geometry store) against the region with Shapely. Because
repr_pointis guaranteed within the geometry, a representative point inside the region already provesintersects— so the expensive polygon load runs only on the remainder.
1.8.3.1.1. intersects vs within (fully-contained)¶
relation chooses between any overlap (intersects, the default) and
fully contained (within — the candidate’s whole geometry must lie inside
the region; a polygon straddling the border is excluded).
For point places the two are identical (a point has no extent to straddle).
exact+withinis the precise, reliable fully-contained test (true polygon-in-polygon). It applies the samerepr_pointshortcut in reverse — a representative point outside the region proves the place cannot be contained, so those candidates are rejected without loading their polygon, and the costly geometry load runs only on the genuine candidates.fuzzy+withinis an approximation (H3 cells only, no polygon load): it is conservative and resolution-bounded, so for guaranteed full-containment preferexact.
1.8.3.2. Why H3? (performance)¶
H3 is used as a cheap prefilter so the expensive exact test runs on as few candidates as possible. Per-candidate cost, measured on a country-scale polygon (~2,000 vertices):
Operation |
Cost / candidate |
|---|---|
H3 fuzzy cell test (membership) |
~5 µs |
Shapely prepared point-in-polygon |
~13 µs |
Building one polygon into Shapely (what exact must pay per polygon) |
~thousands of µs |
So H3 is only ~3× faster than a prepared point test on the operation alone —
but the real win is that fuzzy reuses the small h3_cover already returned by
Elasticsearch and touches no geometry I/O, whereas exact must fetch and
parse each candidate’s polygon. For polygon-bearing candidates that makes H3
roughly two orders of magnitude faster, and at scale the difference is
sub-millisecond filtering versus seconds. The exact pass is reserved for when
true geometric precision is required, run only on the prefilter’s survivors.
1.8.4. Error Handling¶
Condition |
Response |
|---|---|
Embedding generation fails |
Fall back to text-only search |
No vector results |
Expand to fuzzy text search |
Elasticsearch timeout |
Return partial results with warning |
Empty results |
Progressively relax filters (language, geography, time) |
|
|
Geometry store unavailable (exact) |
Exact silently degrades to the fuzzy test |
1.8.5. Performance Optimisation¶
Query embedding cache: Cache embeddings for frequent queries (LRU, 10k entries)
Pre-computed embeddings: Top 10k toponyms cached in memory
Index optimisation: HNSW parameters tuned for recall/latency tradeoff
Connection pooling: Persistent Elasticsearch connections
Spatial containment: H3-cell prefilter before exact geometry tests, with resolved regions cached across requests (see Spatial Containment Filtering)