6. Elasticsearch Index Design¶
6.1. Index Schemas¶
The phonetic search system uses two indices with a many-to-many relationship: places reference toponyms by their name@lang identifier, and each unique toponym is stored once.
6.1.1. Places Index¶
See schemas/places.json for full schema.
Key fields for phonetic search:
{
"place_id": "keyword",
"label": "text",
"toponyms": "keyword[]",
"locations": [{
"geometry": "geo_shape",
"rep_point": "geo_point"
}]
}
The toponyms array contains name@lang identifiers (e.g., ["London@en", "Londra@it"]) that reference documents in the toponyms index.
6.1.2. Toponyms Index¶
See schemas/toponyms.json for full schema.
Each document represents a unique name@language combination:
{
"toponym_id": "keyword",
"name": "text",
"name_lower": "keyword",
"lang": "keyword",
"embedding_bilstm": {
"type": "dense_vector",
"dims": 128,
"index": true,
"similarity": "cosine"
},
"suggest": {
"type": "completion",
"contexts": [{ "name": "lang", "type": "category" }]
}
}
The toponym_id is the name@lang string (e.g., “London@en”), serving as the document ID and the join key to places.
6.2. Ingest Pipelines¶
6.2.1. Toponym Pipeline¶
The extract_language pipeline parses name@lang format and sets the document ID:
{
"description": "Extract language from toponym@lang format",
"processors": [
{
"script": {
"lang": "painless",
"source": "if (ctx.toponym_id != null && ctx.toponym_id.contains('@')) { String[] parts = ctx.toponym_id.splitOnToken('@'); if (parts.length == 2) { ctx.name = parts[0]; ctx.name_lower = parts[0].toLowerCase(); ctx.lang = parts[1]; } }"
}
}
]
}
This enables ingestion of toponyms with ID "London@en", automatically populating:
toponym_id: “London@en” (document ID, join key)name: “London”name_lower: “london”lang: “en”
6.3. HNSW Configuration¶
The Siamese BiLSTM embedding field uses Elasticsearch’s HNSW (Hierarchical Navigable Small World) algorithm for approximate nearest neighbour search.
Default settings (can be tuned):
{
"embedding_bilstm": {
"type": "dense_vector",
"dims": 128,
"index": true,
"similarity": "cosine",
"index_options": {
"type": "hnsw",
"m": 16,
"ef_construction": 100
}
}
}
Parameter guidance:
m: Graph connectivity (higher = better recall, more memory)ef_construction: Build-time quality (higher = better index, slower build)For 80M vectors at 128 dims: expect ~50GB storage for HNSW structure
6.4. Analysers¶
6.4.1. Name Field¶
Standard multilingual text analysis with edge n-grams for prefix matching:
{
"analysis": {
"analyzer": {
"edge_ngram_analyzer": {
"tokenizer": "edge_ngram_tokenizer",
"filter": ["lowercase"]
}
},
"tokenizer": {
"edge_ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 20
}
}
}
}