11. Future Extensions

11.1. Short-Term Enhancements

11.1.1. Additional Language Support

Expand Epitran language mappings to improve training data coverage:

  • African languages (Swahili, Yoruba, Amharic)

  • Southeast Asian languages (Thai, Vietnamese, Burmese)

  • Indigenous languages (where G2P models exist)

Better language coverage in training data improves model generalisation, even for languages not directly supported by Epitran.

11.1.2. Query Refinement

  • Language-weighted search (boost user’s preferred languages)

  • Geographic context (boost places near user’s focus area)

  • Temporal filtering (match historical period of interest)

11.1.3. User Feedback Integration

  • Capture “did you mean?” corrections

  • Track search refinements and click-through patterns

  • Use feedback to improve training data for model updates

11.2. Medium-Term Development

11.2.1. Transliteration Fallback

For scripts without Epitran support during training:

  • Implement rule-based transliteration to Latin script

  • Use transliterated form for initial clustering

  • Iterative refinement will improve coverage

11.2.2. Multi-Model Ensemble

Combine multiple embedding approaches:

  • Siamese BiLSTM character embeddings (current)

  • Transformer-based embeddings (BERT variants)

  • Traditional phonetic codes (Soundex, Metaphone) as features

Score fusion for improved recall and precision.

11.2.3. Historical Phonology

Model sound changes over time:

  • Great Vowel Shift effects on English place names

  • Grimm’s Law for Germanic comparisons

  • Known regional pronunciation shifts

Enable “sounds like it would have in 1500” queries.

11.3. Long-Term Vision

11.3.1. Speech-to-Text Integration

Allow audio queries:

  • User speaks place name

  • Speech recognition generates candidates

  • Phonetic matching finds best place matches

  • Handles accented speech and pronunciation variations

11.3.2. Crowdsourced Pronunciation Data

  • Record native speaker pronunciations

  • Build pronunciation corpus for model training

  • Improve training data clustering accuracy

  • Community-driven language coverage expansion

11.3.3. Cross-Gazetteer Linking

Use phonetic similarity for:

  • Automated candidate matching across authority sources

  • Confidence scoring for potential duplicates

  • Semi-automated deduplication workflows

11.3.4. Scholarly Analysis Tools

  • Phonetic distance matrices for toponymic studies

  • Etymology clustering based on sound patterns

  • Visualisation of phonetic variation by region/period