3.9. Implementation in ArangoDB¶
3.9.1. Overview¶
This document details the implementation of the WHG v4 data model using ArangoDB’s multi-model architecture. ArangoDB naturally supports the attestation-based graph model through its combination of document collections and edge collections.
3.9.2. Document Collections¶
We use seven primary collections:
things — unified conceptual entities (locations, historical entities, collections, periods, routes, itineraries, networks)
names — name labels with multiple semantic types and vector embeddings
geometries — spatial representations with derived fields
timespans — temporal bounds with PeriodO integration
attestations — evidentiary nodes (document collection) containing metadata about claims
edges — typed connections between all entities (edge collection)
authorities — unified reference data for sources, datasets, relation types, periods, and certainty levels
Critical Distinction:
attestations is a document collection (nodes/vertices), NOT an edge collection
edges is the edge collection that connects attestations to other entities
This separation enables attestations to act as junction points in the graph
3.9.2.1. Things Collection¶
// Document in 'things' collection
{
"_key": "constantinople",
"_id": "things/constantinople",
"_rev": "_abc123",
"thing_type": "location",
"description": "Major Byzantine/Ottoman city on the Bosphorus",
"namespace": "whg",
"primary_name": "Constantinople", // denormalized for quick access
"representative_point": [28.98, 41.01], // denormalized for spatial queries
"created": "2025-01-15T10:30:00Z",
"modified": "2025-03-20T14:22:00Z"
}
3.9.2.2. Names Collection¶
// Document in 'names' collection
{
"_key": "name-istanbul-tr",
"_id": "names/name-istanbul-tr",
"name": "İstanbul",
"language": "tr",
"script": "Latn",
"ipa": "isˈtanbuɫ",
"name_type": ["preferred", "toponym"],
"embedding": [0.234, -0.567, 0.123, ...], // 256-dimensional vector
"transliteration_system": null,
"romanized": "Istanbul"
}
3.9.2.3. Geometries Collection¶
// Document in 'geometries' collection
{
"_key": "geom-constantinople-city",
"_id": "geometries/geom-constantinople-city",
"geom": {
"type": "MultiPolygon",
"coordinates": [
[[[28.94, 41.01], [29.00, 41.01], [29.00, 41.05], [28.94, 41.05], [28.94, 41.01]]],
[[[28.90, 41.00], [28.92, 41.00], [28.92, 41.02], [28.90, 41.02], [28.90, 41.00]]]
]
},
"representative_point": [28.97, 41.03],
"bbox": [28.90, 41.00, 29.00, 41.05],
"precision": ["historical_approximate", "uncertain_boundary"],
"precision_km": [5.0, 2.0],
"source_crs": "EPSG:4326"
}
Note on GeometryCollection: ArangoDB does not support the GeoJSON GeometryCollection type. For places with heterogeneous geometry sets (e.g., both point and polygon), store multiple geometry attestations—one per geometry type. This aligns naturally with our attestation model where each geometry claim is a separate evidential statement.
3.9.2.4. Timespans Collection¶
// Document in 'timespans' collection
{
"_key": "timespan-byzantine-period",
"_id": "timespans/timespan-byzantine-period",
"start_earliest": -11644444800000, // Unix timestamp: 330 CE
"start_latest": -11612908800000, // Unix timestamp: 331 CE
"end_earliest": 693878400000, // Unix timestamp: 1453 CE
"end_latest": 694483200000, // Unix timestamp: 1453 CE
"label": "Byzantine Period in Constantinople",
"precision": "year",
"precision_value": 1,
"periodo_id": "periodo:p0byzantine"
}
Field Naming Note: WHG uses end_earliest and end_latest (not stop_earliest/stop_latest) for consistency with W3C Time Ontology. This applies to both internal storage and all exports.
3.9.2.5. Attestations Collection (Document Collection)¶
// Document in 'attestations' collection
// This is a DOCUMENT collection, NOT an edge collection
{
"_key": "att-001",
"_id": "attestations/att-001",
"_rev": "_xyz789",
"sequence": null, // For ordered sequences in routes/itineraries
"connection_metadata": null, // For network relationships
"certainty": 0.95, // Confidence level (0.0-1.0)
"certainty_note": "Well-documented in primary sources",
"notes": "Additional context",
"created": "2024-01-15T10:30:00Z",
"modified": "2024-02-20T14:45:00Z",
"contributor": "researcher@example.edu"
}
What’s NOT in attestations:
No
_fromor_tofields (those are in edges)No
subject_id,object_id, orrelation_typefieldsNo embedded relationships
Relationships are expressed through the edges collection below.
3.9.2.7. Edges Collection (Edge Collection)¶
// Generic edge collection for ALL graph relationships
// This is an EDGE collection with _from and _to fields
// Thing to Attestation edge
{
"_key": "edge-001",
"_id": "edges/edge-001",
"_from": "things/constantinople",
"_to": "attestations/att-001",
"edge_type": "subject_of",
"created": "2025-01-15T10:30:00Z"
}
// Attestation to Name edge
{
"_key": "edge-002",
"_id": "edges/edge-002",
"_from": "attestations/att-001",
"_to": "names/name-istanbul-tr",
"edge_type": "attests_name",
"created": "2025-01-15T10:30:00Z"
}
// Attestation to Geometry edge
{
"_key": "edge-003",
"_id": "edges/edge-003",
"_from": "attestations/att-001",
"_to": "geometries/geom-constantinople-city",
"edge_type": "attests_geometry",
"created": "2025-01-15T10:30:00Z"
}
// Attestation to Timespan edge
{
"_key": "edge-004",
"_id": "edges/edge-004",
"_from": "attestations/att-001",
"_to": "timespans/timespan-byzantine-period",
"edge_type": "attests_timespan",
"created": "2025-01-15T10:30:00Z"
}
// Attestation to Source (Authority) edge
{
"_key": "edge-005",
"_id": "edges/edge-005",
"_from": "attestations/att-001",
"_to": "authorities/source-al-tabari",
"edge_type": "sourced_by",
"created": "2025-01-15T10:30:00Z"
}
// Thing-to-Thing relationship via attestation
{
"_key": "edge-006",
"_id": "edges/edge-006",
"_from": "attestations/att-capital-of",
"_to": "authorities/relation-capital-of",
"edge_type": "typed_by",
"created": "2025-01-15T10:30:00Z"
}
{
"_key": "edge-007",
"_id": "edges/edge-007",
"_from": "attestations/att-capital-of",
"_to": "things/ottoman-empire",
"edge_type": "relates_to",
"created": "2025-01-15T10:30:00Z"
}
// Meta-attestation edge
{
"_key": "edge-008",
"_id": "edges/edge-008",
"_from": "attestations/att-meta",
"_to": "attestations/att-001",
"edge_type": "meta_attestation",
"properties": {
"meta_type": "contradicts"
},
"created": "2025-01-15T10:30:00Z"
}
Edge Types Summary:
Edge Type |
From |
To |
Purpose |
|---|---|---|---|
|
Thing |
Attestation |
Links attestation to the Thing it describes |
|
Attestation |
Name |
Links attestation to a Name claim |
|
Attestation |
Geometry |
Links attestation to a Geometry claim |
|
Attestation |
Timespan |
Links attestation to temporal bounds |
|
Attestation |
Authority |
Links attestation to source citation |
|
Attestation |
Authority |
Links attestation to relation type definition |
|
Attestation |
Thing |
Links attestation to related Thing |
|
Attestation |
Attestation |
Links meta-attestation to target attestation |
|
Authority |
Authority |
Links Source to parent Dataset |
3.9.3. Indexing Strategy¶
3.9.3.1. Things Collection Indexes¶
// Primary key (automatic)
db.things.ensureIndex({ type: "primary", fields: ["_key"] });
// Full-text search on description
db.things.ensureIndex({
type: "fulltext",
fields: ["description"],
minLength: 3
});
// Thing type for filtering
db.things.ensureIndex({
type: "persistent",
fields: ["thing_type"]
});
// Geospatial index on representative_point
db.things.ensureIndex({
type: "geo",
fields: ["representative_point"],
geoJson: false // using [lon, lat] array format
});
3.9.3.2. Names Collection Indexes¶
// Full-text search on name
db.names.ensureIndex({
type: "fulltext",
fields: ["name"],
minLength: 2
});
// Vector index for phonetic similarity (FAISS-backed)
db.names.ensureIndex({
type: "vector",
fields: ["embedding"],
params: {
metric: "cosine",
dimension: 256,
lists: 1000 // IVF parameter for clustering
}
});
// Language and script for filtering
db.names.ensureIndex({
type: "persistent",
fields: ["language", "script"]
});
// IPA for phonetic search
db.names.ensureIndex({
type: "persistent",
fields: ["ipa"]
});
// Name type for filtering
db.names.ensureIndex({
type: "persistent",
fields: ["name_type[*]"]
});
3.9.3.3. Geometries Collection Indexes¶
// Geospatial index on main geometry
db.geometries.ensureIndex({
type: "geo",
fields: ["geom"],
geoJson: true
});
// Geospatial index on representative point
db.geometries.ensureIndex({
type: "geo",
fields: ["representative_point"],
geoJson: false
});
// Bounding box for quick spatial filters
db.geometries.ensureIndex({
type: "persistent",
fields: ["bbox"]
});
// Precision for quality filtering
db.geometries.ensureIndex({
type: "persistent",
fields: ["precision[*]"]
});
3.9.3.4. Timespans Collection Indexes¶
// Temporal range indexes for point-in-time queries
db.timespans.ensureIndex({
type: "persistent",
fields: ["start_latest", "end_earliest"]
});
// Temporal bounds for overlap queries
db.timespans.ensureIndex({
type: "persistent",
fields: ["start_earliest", "end_latest"]
});
// Full-text search on label
db.timespans.ensureIndex({
type: "fulltext",
fields: ["label"]
});
// Precision for temporal certainty filtering
db.timespans.ensureIndex({
type: "persistent",
fields: ["precision"]
});
3.9.3.6. Attestations Collection Indexes¶
// Primary key (automatic)
db.attestations.ensureIndex({ type: "primary", fields: ["_key"] });
// Certainty for confidence filtering
db.attestations.ensureIndex({
type: "persistent",
fields: ["certainty"]
});
// Sequence for ordered relationships (routes/itineraries)
db.attestations.ensureIndex({
type: "persistent",
fields: ["sequence"]
});
// Created timestamp for temporal queries
db.attestations.ensureIndex({
type: "persistent",
fields: ["created"]
});
3.9.3.7. Edges Collection Indexes¶
// Edge indexes (automatic for _from and _to)
db.edges.ensureIndex({ type: "edge" });
// Edge type for filtering
db.edges.ensureIndex({
type: "persistent",
fields: ["edge_type"]
});
// Composite index for efficient traversals
db.edges.ensureIndex({
type: "persistent",
fields: ["_from", "edge_type"]
});
db.edges.ensureIndex({
type: "persistent",
fields: ["_to", "edge_type"]
});
3.9.4. Query Patterns in AQL¶
ArangoDB’s unified query language (AQL) integrates graph traversal, document filtering, geospatial queries, and vector similarity seamlessly.
3.9.4.1. Name Resolution Over Time¶
Query: “What was Chang’an called in 700 AD?”
LET query_date = DATE_TIMESTAMP("700-01-01")
FOR thing IN things
FILTER thing._key == "changan"
// Traverse to attestations via edges
FOR e1 IN edges
FILTER e1._from == thing._id
FILTER e1.edge_type == "subject_of"
LET att = DOCUMENT(e1._to)
// Get the name via edges
FOR e2 IN edges
FILTER e2._from == att._id
FILTER e2.edge_type == "attests_name"
LET name = DOCUMENT(e2._to)
// Check temporal validity via edges
FOR e3 IN edges
FILTER e3._from == att._id
FILTER e3.edge_type == "attests_timespan"
LET ts = DOCUMENT(e3._to)
FILTER ts.start_latest <= query_date
FILTER ts.end_earliest >= query_date
RETURN {
name: name.name,
language: name.language,
certainty: att.certainty,
timespan: ts.label
}
3.9.4.2. Spatial Queries with Temporal Filter¶
Query: “Places within 100km of Constantinople in the 13th century”
LET constantinople_point = [28.98, 41.01]
LET query_start = DATE_TIMESTAMP("1200-01-01")
LET query_end = DATE_TIMESTAMP("1300-12-31")
FOR thing IN things
// Spatial filter using representative_point
FILTER GEO_DISTANCE(thing.representative_point, constantinople_point) <= 100000
// Verify temporal validity via graph traversal
LET temporal_check = (
FOR e1 IN edges
FILTER e1._from == thing._id
FILTER e1.edge_type == "subject_of"
LET att = DOCUMENT(e1._to)
FOR e2 IN edges
FILTER e2._from == att._id
FILTER e2.edge_type == "attests_geometry"
FOR e3 IN edges
FILTER e3._from == att._id
FILTER e3.edge_type == "attests_timespan"
LET ts = DOCUMENT(e3._to)
FILTER ts.start_latest <= query_end
FILTER ts.end_earliest >= query_start
RETURN true
)
FILTER LENGTH(temporal_check) > 0
// Get full geometries
LET geometries = (
FOR e1 IN edges
FILTER e1._from == thing._id
FILTER e1.edge_type == "subject_of"
LET att = DOCUMENT(e1._to)
FOR e2 IN edges
FILTER e2._from == att._id
FILTER e2.edge_type == "attests_geometry"
LET geom = DOCUMENT(e2._to)
FILTER GEO_DISTANCE(geom.representative_point, constantinople_point) <= 100000
RETURN geom
)
FILTER LENGTH(geometries) > 0
RETURN {
thing: thing,
distance: GEO_DISTANCE(thing.representative_point, constantinople_point),
geometries: geometries
}
3.9.4.3. Vector Similarity Search for Toponyms¶
Query: “Find names similar to ‘Chang’an’ across languages”
LET query_embedding = @query_vector // passed as bind parameter
FOR name IN names
LET similarity = APPROX_NEAR_COSINE(name.embedding, query_embedding)
FILTER similarity > 0.8
SORT similarity DESC
LIMIT 10
RETURN {
name: name.name,
language: name.language,
script: name.script,
similarity: similarity,
name_type: name.name_type
}
Important: Use APPROX_NEAR_COSINE() for index-accelerated searches. The non-indexed COSINE_SIMILARITY() function is available but will be much slower at scale.
3.9.4.4. Network Connection Query¶
Query: “All trade connections from Constantinople 1200-1300 CE”
LET query_start = DATE_TIMESTAMP("1200-01-01")
LET query_end = DATE_TIMESTAMP("1300-12-31")
FOR thing IN things
FILTER thing._key == "constantinople"
// Find outgoing attestations via edges
FOR e1 IN edges
FILTER e1._from == thing._id
FILTER e1.edge_type == "subject_of"
LET att = DOCUMENT(e1._to)
// Check if it's a connection via typed_by edge
FOR e2 IN edges
FILTER e2._from == att._id
FILTER e2.edge_type == "typed_by"
LET rel_type = DOCUMENT(e2._to)
FILTER rel_type.label == "connected_to"
FILTER att.connection_metadata LIKE "%trade%"
// Get the connected Thing via relates_to edge
FOR e3 IN edges
FILTER e3._from == att._id
FILTER e3.edge_type == "relates_to"
LET connected_thing = DOCUMENT(e3._to)
// Check temporal validity via attests_timespan edge
FOR e4 IN edges
FILTER e4._from == att._id
FILTER e4.edge_type == "attests_timespan"
LET ts = DOCUMENT(e4._to)
FILTER ts.start_latest <= query_end
FILTER ts.end_earliest >= query_start
RETURN {
connected_place: connected_thing,
connection_type: att.connection_metadata,
certainty: att.certainty,
timespan: ts
}
3.9.5. Handling Temporal Nulls and Geological Time¶
3.9.5.1. Sentinel Values¶
For start_earliest, start_latest, end_earliest, end_latest fields representing infinity or deep time:
Modern era unbounded:
Unknown start:
-9999-01-01→-315619200000(Unix timestamp)Ongoing/present:
9999-12-31→253402300799000(Unix timestamp)
Geological time:
Billion years BCE:
-999999999-01-01→-31556889832000000000(approximate)Use large negative/positive integer values
PeriodO identifiers:
Store PeriodO URIs in AUTHORITY documents
Import PeriodO temporal bounds into Timespan records
3.9.5.2. Query Logic¶
// Point-in-time query
FOR ts IN timespans
FILTER ts.start_latest <= @query_date
FILTER ts.end_earliest >= @query_date
RETURN ts
// Overlap query
FOR ts IN timespans
FILTER ts.start_latest <= @query_end
FILTER ts.end_earliest >= @query_start
RETURN ts
// Unknown bounds handling
FOR ts IN timespans
FILTER (ts.start_earliest == null OR ts.start_earliest <= @query_date)
FILTER (ts.end_latest == null OR ts.end_latest >= @query_date)
RETURN ts
3.9.6. ArangoDB Capabilities Assessment¶
3.9.6.1. What ArangoDB Handles Well¶
Property Graph Queries:
✅ Native graph database with efficient edge traversals
✅ Multi-hop queries optimized (1-10+ hops)
✅ Bidirectional traversals (OUTBOUND, INBOUND, ANY)
✅ Named graphs for domain separation
✅ Shortest path algorithms built-in
Spatial Queries:
✅ Native GeoJSON support (Point, MultiPoint, LineString, MultiLineString, Polygon, MultiPolygon)
✅ S2-based geospatial indexing for spherical geometry
✅
GEO_DISTANCE,GEO_CONTAINS,GEO_INTERSECTSfunctions✅ Efficient spatial filtering with geo indexes
❌ No
GeometryCollectionsupport (workaround: multiple geometry attestations)
Vector/Phonetic Search:
✅ Vector indexes powered by FAISS
✅
APPROX_NEAR_COSINE()for index-accelerated similarity✅ Cosine, Euclidean, and other distance metrics
⚠️ Performance validation needed for 10M+ vectors
Temporal Range Queries:
✅ Efficient range queries on numeric timestamp fields
✅ Multi-field predicates for complex temporal logic
✅ Fast point-in-time and overlap queries with proper indexing
Document Search:
✅ Full-text search with language analyzers
✅ JSON document storage with flexible schemas
✅ Complex filtering on nested fields
Unified Query Language:
✅ AQL integrates all capabilities (graph, document, spatial, vector)
✅ Single query syntax reduces cognitive load
✅ No need to combine multiple query languages or systems
Operational Benefits:
✅ Single system for all data types
✅ No data synchronization between systems
✅ Unified backup/recovery strategy
✅ Real-time consistency (no eventual consistency issues)
3.9.6.2. Considerations and Tradeoffs¶
Vector Search Maturity:
⚠️ Vector indexes added more recently than core features
⚠️ Requires early benchmarking with WHG’s phonetic embedding workload
⚠️ Validate performance with 10M+ name embeddings
GeometryCollection Limitation:
❌ Cannot store heterogeneous geometry collections in single document
✅ Workaround: Multiple geometry attestations (aligns with attestation model)
✅ Alternative: Convert to MultiPolygon by buffering points/lines
Licensing:
⚠️ Community Edition limited to 100 GiB (insufficient for WHG)
⚠️ Enterprise Edition required for production use (expected 500GB-1TB dataset)
⚠️ Academic licensing terms require negotiation
Vendor Ecosystem:
⚠️ Smaller community than PostgreSQL
⚠️ Fewer third-party tools and integrations
⚠️ Limited pool of developers with ArangoDB experience
⚠️ Proprietary query language (AQL) creates some lock-in
Django/PostgreSQL Still Needed For:
✅ User accounts, sessions, authentication
✅ Admin interface (Django Admin)
✅ Audit trails and provenance changelog
✅ Namespace schema mapping
✅ Traditional CRUD operations on application metadata
3.9.7. Summary¶
3.9.7.1. ArangoDB Strengths for WHG¶
✅ Native property graph - Direct mapping to attestation model with Attestations as document nodes
✅ Unified multi-model - Graph + document + geospatial + vector in single system
✅ Superior GeoJSON - Native support for complex historical geometries
✅ Elegant queries - AQL integrates all capabilities seamlessly
✅ Operational simplicity - Single system for small team
✅ Real-time consistency - No sync issues between systems
✅ Flexible schema - JSON documents adapt easily to evolving requirements
3.9.7.2. Key Considerations¶
⚠️ Licensing required - Enterprise Edition needed for production dataset size
⚠️ Vector search maturity - Early benchmarking essential for phonetic embedding workload
⚠️ GeometryCollection limitation - Workaround via multiple attestations (aligns with model)
⚠️ Smaller ecosystem - Fewer third-party tools than PostgreSQL
⚠️ Vendor lock-in - Proprietary AQL creates some switching costs
3.9.7.3. Implementation Priorities¶
Core collections - Things, Names, Geometries, Timespans, Attestations (documents), Edges, Authorities
Essential indexes - Vector (FAISS), geospatial, temporal, edge
Django integration - Signals for real-time sync
Query patterns - Name resolution, spatio-temporal, vector similarity
Dynamic Clustering - Multi-dimensional similarity scoring
Migration pipeline - From Postgres + ElasticSearch
closeMatch migration - Preserve 38K curated relationships
Monitoring - Query performance, index health, resource usage
Performance testing - Validate at production scale
Documentation - Query examples, operational procedures
Related Documentation:
WHG v4 Data Model Overview - Core data model specification
Attestations & Relations - Detailed attestation patterns