4.1. Database Technology Assessment¶
4.1.1. TL;DR¶
WHG v4 is transitioning to a property graph data model centered on attestations (source-backed claims about historical places). We need a database that natively supports:
Property graphs with rich node and edge metadata
GeoJSON geometries for complex historical boundaries
Vector similarity search for phonetic name matching
Temporal queries
Single integrated system (solo technical team)
Assessment finding: ArangoDB is the optimal technical fit - native multi-model architecture matches our requirements precisely, operational simplicity suits our team size, and superior GeoJSON support handles complex historical geometries.
Critical dependency: Sustainable licensing for academic/non-commercial research infrastructure funded primarily by University of Pittsburgh with grant support.
Alternative: PostgreSQL with extensions provides fallback but requires accepting greater complexity.
Hosting: Prefer self-hosted at Pitt, but open to ArangoDB-managed solution depending on terms.
Next step: Licensing discussion to determine viability.
Data model details: See WHG v4 Data Model Documentation
4.1.2. Executive Summary¶
The World Historical Gazetteer (WHG) is undergoing a fundamental redesign of its data model for v4, transitioning from a traditional relational structure to a property graph model centered on attestations. This document evaluates database technologies suitable for this new architecture.
What is WHG?
Free, open-access research infrastructure for historical geography
Serves researchers worldwide with temporal-aware place data
Supports discovery, linking, and analysis of historical place information
Non-commercial academic project
Understanding the Data Model
WHG v4 is built around Things (entities such as locations, historical entities, collections, periods, routes, itineraries, and networks) connected through Attestations (nodes that bundle source-backed claims with temporal bounds and certainty assessments).
Complete technical details: WHG v4 Data Model Documentation
Key concepts:
Thing: Any entity that can be described (what we used to call “places” but now expanded)
Attestation: A node that bundles claims connecting a Thing to information (names, geometries, timespans, types, or other Things)
Authority: Unified reference data for sources, datasets, relation types, periods, and certainty levels
Provenance: Every claim cites sources via AUTHORITY nodes, has certainty assessment, and temporal bounds
Graph structure: Attestations are nodes (not just edges) in a property graph where edges connect Things → Attestations → Names/Geometries/Timespans/Things
Key Requirements:
Property graph with Attestations as first-class nodes (not just edge metadata)
Native GeoJSON support for complex historical geometries
Vector similarity search for phonetic name matching across scripts/languages
Temporal query capabilities
Single integrated system (critical for small technical team)
Sustainable licensing for academic/non-commercial use
Project Context:
Funding: Primarily supported by University of Pittsburgh with grant funding
Technical Team: Currently one developer
Scale: Anticipated 500GB-1TB working dataset, 73M+ nodes
Mission: Free, open-access research infrastructure for historical geography
Hosting: Intended self-hosted at University of Pittsburgh infrastructure
4.1.3. Data Volume Projection¶
4.1.3.1. Source Data Estimates¶
Source |
Records |
Avg Size |
Total |
|---|---|---|---|
GeoNames |
12M |
2KB |
24GB |
OpenStreetMap (subset) |
6M |
1KB |
6GB |
Wikidata Places |
10M |
3KB |
30GB |
Pleiades |
38K |
50KB |
2GB |
Contributed Data |
1-5M |
10KB |
10-50GB |
Subtotal |
~28M |
- |
72-112GB |
4.1.3.2. WHG Attestation Layer¶
Per Thing Estimates (average):
Node Collections:
5 NAME references (shared nodes, not duplicated) ≈ negligible per Thing
2 GEOMETRY nodes × 2KB (GeoJSON + derived fields) = 4KB
3 TIMESPAN nodes × 0.5KB (4-field temporal model) = 1.5KB
5 ATTESTATION nodes × 0.8KB (certainty, notes, metadata) = 4KB
3 AUTHORITY references (shared across many Things) ≈ negligible per Thing
Edge Collection:
~25 edges per Thing × 0.2KB = 5KB
Thing → Attestation (subject_of): 5 edges
Attestation → Name/Geometry/Timespan (attests_*): 10 edges
Attestation → Authority (typed_by, sourced_by): 8 edges
Attestation → Thing (relates_to for relations): 2 edges
Total per Thing: ~14.5KB
For 28M Things:
Node data (Geometries, Timespans, Attestations): 73M × 9.5KB = ~266GB
Edge data: 73M × 5KB = ~140GB
Subtotal per-Thing data: ~406GB
Shared Collections (across all Things):
NAME nodes: ~20GB (15M unique names with embeddings)
AUTHORITY nodes: ~0.1GB (56K shared authorities)
Subtotal shared data: ~426GB
Vector Embeddings:
Estimated unique NAME nodes: ~15M
Embedding storage:
15M unique names × 256 dimensions × 4 bytes = ~15GB raw embedding data
Plus name metadata: ~4.5GB
Total NAME collection: ~20GB
4.1.3.4. Total Raw Data¶
Component |
Size |
|---|---|
Source data (Things + basic attributes) |
72-112GB |
Attestation layer (per-Thing nodes + edges) |
406GB |
Shared NAME nodes (with embeddings) |
20GB |
Shared AUTHORITY nodes |
0.1GB |
Total uncompressed |
~500-540GB |
4.1.3.5. Database-Specific Storage Characteristics¶
4.1.3.5.1. ArangoDB¶
Compression:
Document compression: 2-3x typical ratio
Edge compression: Similar 2-3x ratio
Vector indexes: Quantization can reduce by 4-8x
Estimated working set: 200-300GB
Index Overhead:
Geo indexes (S2): ~20% of geometry data = ~25GB
Vector indexes (FAISS IVF): ~30% of NAME embeddings = ~5GB
Full-text indexes: ~15% of text data = ~15GB
Edge indexes (automatic): ~10% of edge data = ~14GB
Total indexes: ~60GB
Final ArangoDB estimate: 260-360GB
4.1.3.5.2. PostgreSQL + Extensions¶
Index Overhead:
Vector index overhead (HNSW): ~40-50% of embedding data = ~8GB
PostGIS indexes (GIST): ~25% of geometry data = ~31GB
AGE graph indexes: ~15% of graph data = ~60GB
Full-text indexes: ~20% of text data = ~20GB
Total indexes: ~120GB
Final PostgreSQL estimate: 400-500GB
Note: PostgreSQL may require more storage than ArangoDB due to:
Multiple extension indexes with some overlap
AGE graph metadata duplication (stores both relational and graph views)
Less efficient edge storage (normalized tables vs. native edge collections)
4.1.3.6. Growth Projections¶
Timeline |
Dataset Size |
Storage (ArangoDB) |
Storage (PostgreSQL) |
|---|---|---|---|
Year 1 |
28M Things baseline |
260-360GB |
400-500GB |
Year 5 |
+5M contributed Things |
400-500GB |
550-650GB |
Growth characteristics: Modest incremental growth after initial bulk ingestion (~3-5% new Things annually) combined with attestation enrichment (~5-10% per year).
4.1.3.7. Memory Requirements¶
For optimal performance, working set should fit in RAM:
ArangoDB:
Active data: 200-300GB
Vector indexes: 5GB (frequently accessed)
Geo indexes: 25GB
Hot edge indexes: 14GB
Recommended RAM: 128GB (allows OS cache + query working memory)
Minimum acceptable: 64GB with aggressive caching
PostgreSQL:
Shared buffers: 25-40% of RAM (50-75GB)
Vector index cache: 8GB
PostGIS geometry cache: 31GB
AGE graph cache: 60GB
OS page cache: remainder
Recommended RAM: 192GB (for comfortable operation with all extensions)
Minimum acceptable: 128GB with careful tuning
Note: ArangoDB’s lower memory requirements reflect more efficient integrated indexing vs. PostgreSQL’s multiple independent extension caches.
4.1.3.8. Storage I/O Requirements¶
Random reads (most queries involve):
Graph traversals (edge lookups)
Vector similarity search (index navigation)
Spatial queries (geo index traversal)
NVMe SSD strongly recommended:
~100K IOPS for concurrent queries
<1ms latency for graph hops
Sustained throughput for bulk operations
Capacity: 5-10TB to accommodate:
Database growth (5 years)
Backup snapshots
WAL/journal files
Temporary query space
4.1.4. Option 1: ArangoDB¶
4.1.4.1. Technical Advantages¶
1. Optimal Data Model Alignment
ArangoDB’s native multi-model architecture (graph + document + geospatial) maps directly to WHG’s attestation model where Attestations are nodes (not just edges):
// Things as documents
{
"_key": "constantinople",
"_id": "things/constantinople",
"thing_type": "location",
"description": "Major Byzantine/Ottoman city on the Bosphorus",
"primary_name": "Constantinople", // denormalized
"representative_point": [28.98, 41.01] // denormalized
}
// Names as documents with embeddings
{
"_key": "name-istanbul-tr",
"_id": "names/name-istanbul-tr",
"name": "İstanbul",
"language": "tr",
"script": "Latn",
"embedding": [0.234, -0.567, ...] // 256-dimensional
}
// Attestations as nodes (not edges)
{
"_key": "att-001",
"_id": "attestations/att-001",
"certainty": 1.0,
"certainty_note": "Official administrative name change",
"notes": "Adopted after establishment of Turkish Republic"
}
// Edges connect the graph
{
"_from": "things/constantinople",
"_to": "attestations/att-001",
"edge_type": "subject_of"
}
{
"_from": "attestations/att-001",
"_to": "names/name-istanbul-tr",
"edge_type": "attests_name"
}
{
"_from": "attestations/att-001",
"_to": "authorities/source-turkish-gov",
"edge_type": "sourced_by"
}
Key advantage: Attestations as nodes enables meta-attestations (attestations about attestations), rich provenance chains, and natural modeling of scholarly discourse.
2. Unified AUTHORITY Collection
Single table inheritance via authority_type discriminator replaces multiple collections:
// Source authority
{
"_id": "authorities/al-tabari",
"authority_type": "source",
"citation": "Al-Tabari, History of the Prophets and Kings",
"source_type": "manuscript"
}
// Dataset authority
{
"_id": "authorities/dataset-123",
"authority_type": "dataset",
"title": "Islamic Cities Database",
"doi": "doi:10.83427/whg-dataset-123"
}
// Relation type authority
{
"_id": "authorities/member-of",
"authority_type": "relation_type",
"label": "member_of",
"inverse": "contains",
"description": "Subject is part of object entity"
}
// Period authority (PeriodO integration)
{
"_id": "authorities/abbasid",
"authority_type": "period",
"label": "Abbasid Caliphate",
"uri": "periodo:p0abbasid",
"start_earliest": "750-01-01",
"stop_latest": "1258-12-31"
}
This eliminates multiple table joins and simplifies queries.
3. Superior GeoJSON Support
ArangoDB provides native GeoJSON storage and querying with support for Point, MultiPoint, LineString, MultiLineString, Polygon, and MultiPolygon types:
// Store complex geometries
db.geometries.insert({
_key: "geom-constantinople",
geom: {
type: "MultiPolygon",
coordinates: [
// First polygon: city center
[[
[28.94, 41.01],
[29.00, 41.01],
[29.00, 41.05],
[28.94, 41.05],
[28.94, 41.01]
]],
// Second polygon: disputed boundary region
[[
[28.90, 41.00],
[28.92, 41.00],
[28.92, 41.02],
[28.90, 41.02],
[28.90, 41.00]
]]
]
},
representative_point: [28.97, 41.03],
precision: ["historical_approximate"]
})
// Geospatial queries
FOR geom IN geometries
FILTER GEO_DISTANCE(geom.representative_point, @query_point) < 50000
RETURN geom
Key advantages:
S2-based geospatial indexing for spherical geometry
Supports all standard GeoJSON types (Point, MultiPoint, LineString, MultiLineString, Polygon, MultiPolygon)
Native handling of complex historical geometries without conversion
No encoding or workarounds required
Note on GeometryCollection: ArangoDB does not support the GeometryCollection type. For WHG data containing heterogeneous geometry sets (e.g., a place represented by both a point and a polygon), we would either:
Store multiple geometry attestations (one per geometry type) - aligns naturally with our attestation model where each geometry is a separate claim
Convert to MultiPolygon by buffering points and lines into small polygons where single-geometry representation is required
Comparison: PostgreSQL PostGIS supports GeometryCollection natively; Neo4j spatial plugin has limited GeoJSON support; Vespa has no GeoJSON support.
4. Unified Query Language (AQL)
AQL integrates graph traversal, document filtering, geospatial queries, and vector similarity in a single, coherent language. With our model where Attestations are nodes:
// Complex query: Find places with similar names near Constantinople in 13th century
LET query_embedding = @query_vector
LET query_point = [28.98, 41.01]
LET query_start = DATE_TIMESTAMP("1200-01-01")
LET query_end = DATE_TIMESTAMP("1300-12-31")
FOR thing IN things
// Spatial constraint
FILTER GEO_DISTANCE(thing.representative_point, query_point) < 50000
// Find names via graph traversal
LET matching_names = (
FOR att IN attestations
FOR e1 IN edges
FILTER e1._from == thing._id
FILTER e1._to == att._id
FILTER e1.edge_type == "subject_of"
FOR e2 IN edges
FILTER e2._from == att._id
FILTER e2.edge_type == "attests_name"
LET name = DOCUMENT(e2._to)
// Vector similarity (index-accelerated)
LET similarity = APPROX_NEAR_COSINE(name.embedding, query_embedding)
FILTER similarity > 0.8
// Check temporal validity
FOR e3 IN edges
FILTER e3._from == att._id
FILTER e3.edge_type == "attests_timespan"
LET ts = DOCUMENT(e3._to)
FILTER ts.start_latest <= query_end
FILTER ts.stop_earliest >= query_start
RETURN {name: name.name, similarity: similarity}
)
FILTER LENGTH(matching_names) > 0
// Find network connections
LET trade_partners = (
FOR att IN attestations
FOR e1 IN edges
FILTER e1._from == thing._id
FILTER e1._to == att._id
FOR e2 IN edges
FILTER e2._from == att._id
FILTER e2.edge_type == "typed_by"
LET rel_type = DOCUMENT(e2._to)
FILTER rel_type.label == "connected_to"
FOR e3 IN edges
FILTER e3._from == att._id
FILTER e3.edge_type == "relates_to"
LET partner = DOCUMENT(e3._to)
RETURN partner
)
LIMIT 10
RETURN {
thing: thing,
names: matching_names,
trade_partners: trade_partners
}
This eliminates the need to combine multiple query languages or systems. The graph traversal naturally handles the Attestation-as-node pattern.
5. Operational Simplicity
Critical for small team: Single system handles:
✅ Graph queries (Attestations as nodes, multi-hop traversals)
✅ Document search (full-text, filtering)
✅ Vector similarity (name matching via embeddings)
✅ Geospatial (GeoJSON native)
✅ Temporal queries (efficient range queries)
Impact:
One database to maintain and monitor
Single backup/recovery strategy
Unified deployment process
No data synchronization between systems
Reduced cognitive load for solo developer
6. JSON-Native Architecture
Every node is a JSON document, matching WHG’s data modeling approach and facilitating:
Flexible schema evolution (AUTHORITY discriminator pattern)
Natural representation of complex temporal structures
Direct mapping from Linked Places Format (LPF)
Easy integration with Django REST framework
7. Vector Indexes for Phonetic Name Matching
ArangoDB provides vector indexes for approximate nearest neighbor search, powered by the FAISS library:
db.names.ensureIndex({
type: "vector",
fields: ["embedding"],
params: {
metric: "cosine",
dimension: 384 // or dimension appropriate for model
}
})
AQL Vector Search uses the APPROX_NEAR_COSINE() function for index-accelerated queries:
FOR name IN names
LET similarity = APPROX_NEAR_COSINE(name.embedding, @query_embedding)
FILTER similarity > 0.8
SORT similarity DESC
LIMIT 10
RETURN {name: name.name, similarity: similarity}
WHG Use Case: Our embeddings are derived from phonetic representations of toponyms using a Siamese BiLSTM model trained on GeoNames data with PanPhon feature sequences of IPA transcriptions. These phonetic embeddings encode pronunciation similarity rather than semantic meaning.
Names across different scripts (e.g., “Constantinople” ↔ “Κωνσταντινούπολις” ↔ “القسطنطينية”)
Transliteration variations (e.g., “Samarkand” ↔ “Samarqand” ↔ “Самарканд”)
Historical name forms that sound similar but are spelled differently
Scale: Name similarity search across 10M+ name variants with phonetic embeddings.
Note: AQL also provides a non-indexed COSINE_SIMILARITY() function for exact similarity calculations, but APPROX_NEAR_COSINE() is the index-accelerated function required for performance at scale. ArangoDB’s vector search capabilities were added more recently than its core graph and geospatial features, so early benchmarking with WHG’s specific phonetic embedding workload is recommended to validate performance characteristics.
4.1.4.2. Deployment Considerations¶
Preferred Approach: Self-Hosted at University of Pittsburgh
WHG intends to host ArangoDB on University of Pittsburgh infrastructure (Pitt CRC).
Hardware Requirements (based on 260-360GB working set):
CPU:
Minimum: 16 cores
Recommended: 24-32 cores
Purpose: Query parallelization, vector index operations, graph traversals, concurrent user requests
Note: Multi-core critical for AQL query optimization and FAISS vector search
RAM:
Minimum: 64GB
Recommended: 128GB
Breakdown:
Active data in memory: ~200-300GB working set target
Vector indexes (hot): ~5GB
Geo indexes: ~25GB
Edge indexes: ~14GB
ArangoDB processes: ~10-15GB
OS and cache: ~20-30GB
Rationale: Keeping frequently-accessed data in RAM is critical for sub-100ms query response
Storage:
Primary Database: 1.5TB NVMe SSD
Backup Storage: 3-4TB (separate volume)
3x full database snapshots
Incremental backups
Export archives
Performance Requirements:
NVMe SSD strongly recommended (not SATA SSD)
~50K+ random IOPS for graph traversals
<1ms latency for edge lookups
Sustained read: 2-3GB/s
Sustained write: 1-2GB/s
Network:
Internal: 10Gbps minimum for single-server deployment
If clustered (3+ nodes): 25Gbps low-latency interconnect
External: Standard university network connection sufficient for web application traffic
Software Requirements:
OS: Linux (Ubuntu 22.04 LTS or RHEL 9 recommended)
Database: ArangoDB Enterprise Edition 3.12+
Monitoring: Prometheus + Grafana (for metrics dashboards)
Backup: ArangoDB’s built-in tools + custom scripts
Optional: Docker/Kubernetes for containerized deployment
Operational Requirements:
System administration: Pitt CRC support for:
Initial server provisioning
OS-level maintenance and security patches
Network configuration
Storage management
Database administration (can be handled by WHG developer with training):
Query optimization
Index tuning
Backup management
Monitoring and alerting configuration
Backup strategy:
Automated nightly snapshots
Weekly full backups
Offsite replication (Pitt backup infrastructure)
30-day retention policy
Monitoring and alerting:
Query performance metrics
Resource utilization (CPU, RAM, disk I/O)
Error rates and warnings
Backup success/failure notifications
Alternative: ArangoDB-Managed Hosting (ArangoGraph)
Given practical considerations, we are open to ArangoDB-managed hosting as an alternative to self-hosting.
Benefits of managed hosting:
Reduces operational burden on solo developer
Professional 24/7 database administration
Automated backups and updates
Better resource scaling and elasticity
Faster time to production
No hardware procurement lead time
Included monitoring and alerting
Tradeoffs:
Potentially higher ongoing costs vs. self-hosted
Less control over infrastructure
Dependency on vendor availability
Data egress costs if migrating away
Consideration: Managed hosting pricing should be part of licensing discussion, as it may provide better total cost of ownership than self-hosted given staffing constraints and solo developer operational burden.
Recommendation: Start with pricing quote for managed hosting (ArangoGraph) to compare against self-hosted TCO before making final deployment decision.
4.1.4.3. Limitations¶
1. Licensing Constraints
Community Edition (v3.12+):
100 GiB dataset limit per cluster
Restricted to non-commercial use under BSL 1.1
Insufficient for WHG (requires 600-750GB)
Enterprise Edition:
Required for WHG’s dataset size
Licensing terms require negotiation
Academic/non-commercial pricing available but not published
Uncertainty in long-term cost sustainability
2. GeometryCollection Not Supported
ArangoDB does not support the GeoJSON GeometryCollection type. For WHG:
Workaround: Store separate geometry attestations for each geometry type - this aligns perfectly with our attestation model where each geometry is a separate claim with its own provenance
Alternative: Convert to MultiPolygon where needed by buffering points/lines
Impact: Minimal - our attestation-as-node model naturally accommodates multiple geometries per Thing, each with separate sources and temporal bounds
3. Vendor Ecosystem
Smaller community than PostgreSQL or Neo4j
Proprietary query language (AQL) creates some lock-in
Fewer third-party tools and integrations
Limited pool of developers with ArangoDB experience
4. Vector Search Maturity
Vector indexes powered by FAISS library
Index-accelerated similarity search via
APPROX_NEAR_COSINE()functionLess battle-tested than specialized systems (Vespa, pgvector) for very large vector datasets
Performance characteristics at 10M+ vectors should be validated
Critical: Early benchmarking with WHG’s phonetic embedding workload (10M+ name variants) is essential to validate that performance meets requirements before committing to this architecture
4.1.4.4. Assessment for WHG¶
ArangoDB represents the best technical fit for WHG’s requirements:
Attestations as nodes enable rich provenance and meta-attestations
AUTHORITY collection pattern simplifies queries
Graph traversal naturally handles complex relationship patterns
Excellent GeoJSON support (GeometryCollection limitation addressed by our model)
Unified architecture for small team
Scales to anticipated dataset size
Single system reduces operational complexity
Critical question: Can sustainable licensing be negotiated for academic/research use?
4.1.5. Option 2: PostgreSQL + Extensions¶
4.1.5.1. Architecture¶
PostgreSQL 16+ with:
PostGIS: Geospatial (GeoJSON, spatial indexes)
pgvector: Vector similarity search (embeddings)
Apache AGE: Graph database (Cypher-compatible queries)
pg_trgm: Fuzzy text matching
Full-text search: Built-in (tsquery/tsvector)
4.1.5.2. Advantages¶
1. Licensing & Sustainability ⭐
PostgreSQL License (permissive, OSI-approved)
No dataset size limits
No commercial restrictions
No vendor lock-in
Sustainable for academic use indefinitely
2. Proven at Scale
Powers massive applications (Instagram, Spotify, Reddit)
Organizations successfully run multi-TB Postgres deployments
Excellent query optimizer
Mature partitioning and sharding
30+ years of continuous development
3. Excellent Geospatial (PostGIS)
Industry standard for GIS
Full GeoJSON support:
geometry(GEOMETRY, 4326)columnsSupports GeometryCollection: Can store heterogeneous geometry types per record
Sophisticated spatial indexing (GIST, SP-GIST)
Rich spatial functions
Superior to ArangoDB for GeometryCollection support
Example:
-- Store GeoJSON including GeometryCollection
CREATE TABLE geometries (
id UUID PRIMARY KEY,
geom GEOMETRY(GEOMETRYCOLLECTION, 4326),
precision VARCHAR(20)
);
INSERT INTO geometries VALUES (
gen_random_uuid(),
ST_GeomFromGeoJSON('{
"type": "GeometryCollection",
"geometries": [
{"type": "Point", "coordinates": [28.98, 41.01]},
{"type": "Polygon", "coordinates": [[[28.94, 41.01], ...]]}
]
}'),
'mixed'
);
-- Geospatial query
SELECT * FROM geometries
WHERE ST_DWithin(
geom,
ST_SetSRID(ST_MakePoint(28.98, 41.01), 4326),
50000 -- 50km
);
4. Strong Vector Search (pgvector)
HNSW indexes for approximate nearest neighbor search
Handles millions of vectors efficiently
Active development (major improvements 2023-2024)
Mature enough for production use
Suitable for phonetic name embeddings
Example:
CREATE TABLE names (
id UUID PRIMARY KEY,
name TEXT,
phonetic_embedding vector(384) -- from phonetic representation
);
CREATE INDEX ON names USING hnsw (phonetic_embedding vector_cosine_ops);
-- Phonetic similarity query
SELECT name, 1 - (phonetic_embedding <=> query_vector) AS similarity
FROM names
ORDER BY phonetic_embedding <=> query_vector
LIMIT 10;
5. Graph Support via Apache AGE
Apache AGE enables property graph modeling in PostgreSQL with Cypher queries. For our Attestation-as-node model:
-- Create graph schema
SELECT create_graph('whg_graph');
-- Insert nodes (Things, Attestations, Names, Authorities)
SELECT * FROM cypher('whg_graph', $$
CREATE (t:Thing {id: 'constantinople', type: 'location'})
CREATE (a:Attestation {id: 'att-001', certainty: 1.0})
CREATE (n:Name {id: 'name-istanbul', name: 'İstanbul'})
CREATE (auth:Authority {id: 'source-123', type: 'source'})
CREATE (t)-[:SUBJECT_OF]->(a)
CREATE (a)-[:ATTESTS_NAME]->(n)
CREATE (a)-[:SOURCED_BY]->(auth)
$$) AS (result agtype);
-- Query graph
SELECT * FROM cypher('whg_graph', $$
MATCH (t:Thing)-[:SUBJECT_OF]->(a:Attestation)-[:ATTESTS_NAME]->(n:Name)
WHERE t.id = 'constantinople'
RETURN t, a, n
$$) AS (thing agtype, attestation agtype, name agtype);
Note: AGE is newer than other PostgreSQL extensions and less mature than Neo4j, but actively developed.
6. Massive Ecosystem
Huge community (largest of any database)
Extensive tooling (monitoring, backup, replication)
Universal hosting provider support
Easy to recruit developers
Abundant learning resources
Well-known to university IT departments
4.1.5.3. Deployment Considerations¶
Self-Hosted at University of Pittsburgh (Pitt CRC)
Hardware Requirements (based on 400-500GB working set):
CPU:
Minimum: 16 cores
Recommended: 32 cores
Purpose: Parallel query execution, multiple extension operations (PostGIS + pgvector + AGE), indexing
Note: PostgreSQL benefits significantly from higher core counts when running multiple extensions simultaneously
RAM:
Minimum: 128GB (with careful tuning)
Recommended: 192GB
Rationale: Multiple extensions each maintain separate caches; higher RAM reduces extension coordination overhead
Note: PostgreSQL + extensions typically requires more RAM than ArangoDB for equivalent dataset due to less integrated architecture
Storage:
Primary Database: 2TB NVMe SSD
Backup Storage: 4-5TB (separate volume)
3x full database dumps
Incremental WAL archives
Extension-specific backups
Performance Requirements:
NVMe SSD required (SATA SSD insufficient for graph + spatial queries)
~60K+ random IOPS for combined extension operations
<1ms latency for index operations
Sustained read: 3-4GB/s
Sustained write: 2-3GB/s
Network:
Internal: 10Gbps minimum
If replicated (primary + standby): 10Gbps+ low-latency
External: Standard university network connection
Software Requirements:
OS: Linux (Ubuntu 22.04 LTS or RHEL 9)
Database: PostgreSQL 16+
Extensions:
PostGIS 3.4+
pgvector 0.5+
Apache AGE 1.5+
pg_trgm (included)
Connection Pooler: PgBouncer (required for AGE)
Monitoring: Prometheus + Grafana + PostgreSQL-specific exporters
Backup: pg_dump, WAL archiving, Barman (recommended)
Operational Requirements:
Initial setup complexity: Higher than ArangoDB due to multiple extensions
Configuration tuning: Each extension requires separate parameter tuning
Backup strategy:
Full database dumps (pg_dump): weekly
WAL archiving: continuous
Extension metadata backups: daily
Point-in-time recovery: 30-day window
Monitoring: Must track metrics for each extension separately
Maintenance: VACUUM, ANALYZE, REINDEX schedules must account for extension overhead
Note on Extension Coordination:
PostGIS, pgvector, and AGE are developed independently
Updates to one extension may impact others
Compatibility testing required before upgrading any component
More complex troubleshooting when issues span multiple extensions
Note: May require more resources than ArangoDB due to extension overhead
Software Requirements:
Linux (Ubuntu 20.04+ or RHEL 8+)
PostgreSQL 16+
PostGIS extension
pgvector extension
Apache AGE extension
Connection pooler (PgBouncer)
Monitoring (PostgreSQL-specific tools well-established)
4.1.5.4. Limitations¶
1. Graph Queries Less Elegant
Apache AGE provides Cypher queries over PostgreSQL, but:
Newer extension (2021), less mature than Neo4j or ArangoDB
Graph traversals slower than native graph databases
Combining SQL and Cypher requires switching contexts
Multi-hop queries can require careful optimization
Attestations-as-nodes pattern adds complexity - more node hops than if attestations were just edge properties
2. Multi-Extension Complexity
Three extensions to coordinate (PostGIS, pgvector, AGE), each with its own query patterns, index types, and performance characteristics.
Impact: More cognitive overhead than unified system, though manageable.
3. Unified Query Complexity
Combining all capabilities requires mixing SQL, Cypher, and spatial functions - significantly more verbose than ArangoDB’s unified AQL.
Example of the same query as ArangoDB above would require:
Cypher for graph traversal (Attestations as nodes)
SQL for vector similarity (pgvector functions)
PostGIS for spatial filtering
Complex CTEs or subqueries to combine results
4. Graph Performance Ceiling
Not optimized for graph traversals
Deep graph queries (4+ hops) may require denormalization
Attestation-as-node pattern increases hop count (Thing → Attestation → Name is 2 hops vs. 1 edge)
Network analysis algorithms less efficient than native graph databases
4.1.5.5. Assessment for WHG¶
PostgreSQL provides a viable alternative with:
✅ Guaranteed sustainability (free, permissive license)
✅ All required capabilities (including GeometryCollection)
✅ Proven at scale
✅ Familiar to university IT environments
Trade-offs:
More complex query patterns (especially with Attestations as nodes)
Requires more careful optimization
Multiple extensions to coordinate
Graph performance ceiling for deep traversals
4.1.6. Option 3: Neo4j¶
4.1.6.1. Advantages¶
1. Best Graph Performance
Native graph storage and processing
Optimized for multi-hop traversals
Attestations as nodes is natural pattern
2. Mature Graph Capabilities
Cypher query language well-established
Rich graph algorithms library
Excellent for complex relationship queries
4.1.6.2. Limitations¶
1. Weaker Spatial Support
Limited GeoJSON support, no GeometryCollection support
Spatial plugin less mature than PostGIS
Would likely need secondary system for complex spatial queries
2. Vector Search in Enterprise Only
Enterprise Edition required for vector indexes (phonetic embeddings)
Introduces licensing costs
No vector search in Community Edition
3. Multi-System Architecture Required
Would need separate database for spatial/vector operations
Problematic for solo developer
Data synchronization complexity
4.1.6.3. Assessment for WHG¶
Not recommended given:
Need for secondary system (spatial/vector)
Licensing costs for Enterprise Edition
Operational complexity for small team
4.1.7. Option 4: Vespa¶
4.1.7.1. Why Vespa Was Initially Attractive¶
Vespa was selected for WHG v4 when the architectural priorities emphasized:
1. Vector Search Excellence
Vespa offers industry-leading vector similarity search capabilities:
Native approximate nearest neighbor (ANN) search with HNSW indexes
Optimized for billions of vectors with sub-10ms query latency
Multiple distance metrics (cosine, euclidean, angular, hamming)
Hybrid search combining BM25 text ranking with vector similarity
Real-time updates and indexing
For WHG’s phonetic name matching use case, Vespa represented the gold standard:
Cross-linguistic toponym matching via phonetic embeddings
Matching across different scripts (Latin, Arabic, Chinese, Cyrillic, etc.)
Handling transliteration variations
Scale to 10M+ name variants with high recall
2. Unified Search Platform
Vespa positioned itself as a single system for:
Full-text search (BM25, linguistic features)
Vector similarity search
Structured data queries
Real-time indexing and updates
This “all-in-one” promise aligned with our goal of operational simplicity for a small team.
3. Scalability and Performance
Battle-tested at massive scale (Yahoo, Spotify, OkCupid)
Content distribution capabilities for global access
Low-latency queries even with complex ranking
Horizontal scaling for growing datasets
4. Open Source with Strong Backing
Apache 2.0 license (no licensing concerns)
Backed by Verizon Media/Yahoo
Active development and community
Professional support available
4.1.7.2. Why the Data Model Redesign Changed Everything¶
The evolution toward an attestation-based property graph model fundamentally altered our requirements in ways that expose Vespa’s limitations:
4.1.7.3. Critical Limitation 1: Not a Graph Database¶
The Problem:
Vespa is a document search engine, not a graph database. Our attestation model requires:
Attestations as first-class nodes with their own identifiers
Multi-hop graph traversals: Thing → Attestation → Name → Authority → Period
Bidirectional relationships: Finding all Things connected to a Source, or all Sources supporting an Attestation
Meta-attestations: Attestations about other Attestations (scholarly discourse)
Provenance chains: Tracing claims through multiple levels of citation
Vespa’s Approach:
Vespa treats everything as documents with embedded relationships. To model our graph:
// Vespa document - all relationships embedded
{
"thing_id": "constantinople",
"names": [
{
"name": "Constantinople",
"attestation_id": "att-001",
"sources": ["source-123"],
"certainty": 0.9,
"timespan": {...}
},
{
"name": "İstanbul",
"attestation_id": "att-002",
"sources": ["source-456"],
"certainty": 1.0,
"timespan": {...}
}
],
"geometries": [...],
"relations": [...]
}
Problems with this approach:
Attestations lose identity: They become anonymous nested objects, not addressable entities
No meta-attestations: Cannot create attestations about attestations when attestations aren’t nodes
Duplication: Same source/timespan data duplicated across many documents
Query complexity: Finding “all Things citing source X” requires scanning all documents
Update complexity: Changing a source citation requires updating potentially thousands of documents
No graph algorithms: Cannot run centrality analysis, shortest path, community detection on embedded data
Graph Database Approach (ArangoDB/Neo4j):
// Attestation is a node
{
"_id": "attestations/att-001",
"certainty": 0.9,
"notes": "..."
}
// Connected via edges
Thing ──[subject_of]──> Attestation ──[attests_name]──> Name
│
└──[sourced_by]──> Authority
│
└──[attests_timespan]──> Timespan
Benefits:
Each entity addressable
Bidirectional traversal native
Meta-attestations natural (new edges to attestation nodes)
Updates affect only changed nodes
Graph algorithms work natively
4.1.7.4. Critical Limitation 2: No GeoJSON Support¶
The Problem:
Historical places have complex geometries:
Multiple disputed boundaries for the same period
Territorial changes over time
Points, polygons, and linestrings for the same entity
Imprecise historical regions
Required GeoJSON types:
Point (settlements, monuments)
MultiPoint (scattered settlement patterns)
LineString (roads, rivers, walls)
MultiLineString (trade routes, river systems)
Polygon (territories, regions)
MultiPolygon (non-contiguous territories, islands)
GeometryCollection (mixed geometry types)
Vespa’s Spatial Capabilities:
Vespa only supports 2D positions (latitude/longitude pairs):
No polygon support
No linestring support
No multi-geometry support
No GeometryCollection
Workaround Required:
To store a historical territory boundary in Vespa:
{
"location": [28.98, 41.01], // Single representative point
"boundary_encoded": "base64_encoded_geojson_string", // Custom encoding
"boundary_wkt": "POLYGON((...))" // Alternative encoding
}
Problems:
No spatial queries: Cannot query “places within this polygon” or “territories intersecting this region”
Custom decoding: Application must decode geometries for display
No spatial indexing: Representative point only, not actual boundaries
Geometry comparison: Cannot compute overlaps, containment, distance to polygons
Result: Would need a secondary spatial database (PostGIS) anyway, defeating the “single system” goal.
4.1.7.5. Critical Limitation 3: Document Model vs. Attestation Model¶
WHG’s Attestation Philosophy:
Every claim is:
Explicitly sourced (one or more sources)
Temporally bounded (valid during specific timespan)
Uncertainty-qualified (certainty level with explanation)
Independently assessable (can be challenged, supported, superseded)
Example: “Constantinople was called İstanbul (1930-present)”
This is one attestation node connecting:
Thing: Constantinople
Name: İstanbul
Authority (source): Turkish Geographic Board
Authority (relation type): has_name
Timespan: 1930-present
Certainty: 1.0
Vespa’s Document Model:
Would require embedding this information structure within a Constantinople document:
{
"thing_id": "constantinople",
"names": [{
"name": "İstanbul",
"language": "tr",
"sources": ["Turkish Geographic Board"],
"start_year": 1930,
"certainty": 1.0
}]
}
What’s Lost:
Source reification: “Turkish Geographic Board” is just a string, not a linked entity with its own metadata (authority type, URI, dates)
Relation types as data: “has_name” is implicit in the structure, not an entity with inverse relationships, domain/range constraints
Temporal precision: Reduced to simple year fields instead of four-field PeriodO model with earliest/latest bounds
Provenance depth: Cannot represent that “Source A cited Source B” or “Attestation X supersedes Attestation Y”
Scholarly discourse: Cannot model debates, challenges, revisions to attestations
4.1.7.6. Critical Limitation 4: Query Pattern Mismatch¶
Query Examples Our Model Requires:
“What sources support the claim that X was called Y in period Z?”
Graph: Traverse Thing → Attestations → Names (filter) → Sources + Timespans
Vespa: Scan all documents, filter embedded names array, extract source strings
“Which places cite source S and have uncertain boundaries?”
Graph: Traverse Source → Attestations (filter certainty) → Things → Geometries (filter precision)
Vespa: Full corpus scan checking embedded source strings and geometry metadata
“Show scholarly disagreement about location X’s temporal bounds”
Graph: Find all Timespan attestations for X, compare, identify conflicts
Vespa: Parse embedded temporal data from document, requires application logic
“Find attestations about attestation A” (meta-attestations)
Graph: Native - just query edges pointing to attestation node A
Vespa: Impossible - attestations aren’t addressable entities
4.1.7.7. What Vespa Does Exceptionally Well (But We Don’t Need)¶
1. Content Ranking and Recommendation
Vespa excels at:
Hybrid BM25 + vector search with sophisticated ranking models
Learning-to-rank with ML models for relevance tuning
Personalized recommendations
A/B testing ranking strategies
WHG’s needs are different: While we do need to rank candidates within reconciliation clusters (by combined similarity scores: name + spatial + temporal), this is straightforward scoring logic that can be implemented in AQL queries. Vespa’s advanced learning-to-rank and personalization features (designed for search engines and recommendation systems) exceed our simpler multi-dimensional similarity requirements.
2. Real-Time Streaming Updates
Vespa handles:
High-velocity data ingestion
Immediate index updates
Partial document updates
WHG doesn’t need this: Historical data changes slowly; batch updates are sufficient.
3. Content Distribution
Vespa provides:
Multi-region deployment
Content replication
Edge caching
WHG doesn’t need this: Single-region deployment at Pitt is adequate; we’re not a CDN.
4.1.7.8. The Bottom Line¶
Vespa was the right choice for a vector-search-first architecture with embedded relationships.
Vespa is the wrong choice for a property graph architecture with reified attestations.
The data model redesign privileged:
Historiographical precision over search ranking
Provenance depth over query speed
Scholarly discourse modeling over content distribution
Graph semantics over document embeddings
These priorities make Vespa fundamentally unsuitable, despite its vector search excellence.
4.1.7.9. Why Not Use Vespa for Vectors + Another DB for Graph?¶
Considered Multi-System Architecture:
Vespa: Vector search on names
Neo4j/ArangoDB: Graph relationships
Sync embeddings from graph to Vespa
Problems:
Complexity explosion: Two databases to maintain, monitor, backup
Synchronization overhead: Keeping vector index aligned with graph
Query splitting: Every search requires querying both systems and merging results
Operational burden: Unacceptable for solo developer
Data integrity: Risk of desynchronization between systems
ArangoDB’s Answer:
FAISS-backed vector indexes provide 80% of Vespa’s vector performance with native graph support, eliminating the need for multiple systems. The 20% performance gap is acceptable given operational simplicity.
4.1.7.10. Assessment¶
Vespa is unsuitable for the attestation-based property graph model that defines WHG v4. While its vector search capabilities are best-in-class, the fundamental mismatch between document-oriented search and graph-oriented historical knowledge representation makes it an architectural dead-end for our requirements.
The shift from Vespa to ArangoDB reflects the evolution from “gazetteer as search engine” to “gazetteer as historical knowledge graph.”
4.1.8. Detailed Comparison Matrix¶
Feature |
ArangoDB |
PostgreSQL + Extensions |
Neo4j Enterprise |
Vespa |
|---|---|---|---|---|
Licensing Model |
BSL 1.1 CE / Commercial EE |
PostgreSQL (permissive) |
GPL CE / Commercial EE |
Apache 2.0 |
Academic Viability |
Requires negotiation |
✅ Guaranteed |
Requires negotiation |
✅ Guaranteed |
Dataset Limits |
100GB CE / None EE |
✅ None |
✅ None |
✅ None |
Attestations as Nodes |
✅✅✅ Natural |
✅ Via AGE |
✅✅✅ Natural |
❌ Manual |
Graph Queries |
✅✅ Native AQL |
⚠️ AGE/Cypher |
✅✅✅ Native Cypher |
❌ Manual |
Graph Performance |
✅✅ Excellent |
⚠️ Good with tuning |
✅✅✅ Best |
❌ Poor |
Vector Search |
✅ Good (v3.10+) |
✅ Good (pgvector) |
✅ EE only (v5.11+) |
✅✅✅ Best |
Phonetic Embeddings |
✅ Supported |
✅ Supported |
✅ EE only |
✅✅✅ Ideal |
GeoJSON Support |
✅✅ Native (6 types) |
✅✅✅ PostGIS (all types) |
⚠️ Plugin, limited |
❌ None |
GeometryCollection |
❌ Not supported |
✅✅✅ Fully supported |
❌ Not supported |
❌ Not supported |
Single System |
✅✅✅ Yes |
✅✅✅ Yes |
❌ Need additional DB |
✅ Yes (but inadequate) |
Query Elegance |
✅✅✅ Unified AQL |
⚠️ SQL + Cypher + spatial |
✅✅ Cypher |
❌ Complex custom code |
Operational Complexity |
✅ Low |
✅ Low-Medium |
❌ High (multi-system) |
⚠️ Medium |
Fit for WHG Model |
✅✅✅ Excellent |
✅✅ Very Good |
⚠️ Graph only |
❌ Poor |
4.1.9. Recommendation¶
4.1.9.1. Primary Option: ArangoDB¶
ArangoDB represents the best technical fit for WHG v4 based on:
Data Model Alignment: Native property graph with Attestations as nodes (not just edge metadata) directly matches our architecture (see WHG v4 Data Model)
AUTHORITY Pattern: Single collection with discriminator pattern simplifies queries
GeoJSON Support: Native support for 6 GeoJSON types sufficient for our needs; GeometryCollection limitation naturally addressed by our multi-attestation model where each geometry is a separate claim
Query Integration: Unified AQL for graph traversal + spatial + vector + temporal operations reduces complexity
Operational Simplicity: Single system manageable by small team
Adequate Scale: Handles anticipated ~500GB dataset
Vector Capabilities: Suitable for phonetic name embedding search
Critical dependency: Securing sustainable licensing terms for academic/research use.
WHG Project Context:
Non-commercial research infrastructure
Free access to all users
Primary funding from University of Pittsburgh
Grant-dependent for operational costs
Solo technical team
Long-term sustainability essential
Deployment Preference: Self-hosted at Pitt, but open to ArangoDB-managed hosting if terms are favorable and reduce operational burden.
Request: Academic licensing discussion with ArangoDB to determine viability, including both self-hosted and managed hosting options.
4.1.9.2. Alternative: PostgreSQL + Extensions¶
PostgreSQL provides a viable fallback with:
Guaranteed long-term sustainability
Full GeometryCollection support via PostGIS
Graph capabilities via Apache AGE (with Attestations as nodes)
All required capabilities (with greater complexity)
Scenario: If ArangoDB licensing proves unsustainable, PostgreSQL is a defensible alternative that meets all requirements, though with more complex query patterns for our graph model.
4.1.10. Next Steps¶
Initiate licensing discussion with ArangoDB regarding academic use case
Provide project details: scale, funding model, technical team size, mission
Explore deployment options: self-hosted vs. managed hosting pricing
Benchmark vector search: Test
APPROX_NEAR_COSINE()performance with realistic phonetic embeddings (10M+ vectors) early in evaluation processValidate graph queries: Test Attestation-as-node pattern with multi-hop queries combining graph traversal, geospatial constraints, and vector filtering
Decision timeline: Required for v4 development planning (target: Q2 2025)
4.1.11. Conclusion¶
The shift to an attestation-based property graph model with Attestations as nodes (not just edge metadata) necessitates reconsideration of the database backend documented in previous WHG v4 plans (Vespa). For full details on the data model, see WHG v4 Data Model Documentation.
ArangoDB emerges as the optimal technical solution, offering native support for:
Property graphs where Attestations are first-class nodes
Document collections for Things, Names, Geometries, Timespans, and Authorities
Unified AUTHORITY collection pattern via discriminator
GeoJSON geometries for complex historical boundaries
Vector similarity search for phonetic name matching (via Siamese BiLSTM embeddings)
All capabilities integrated in a single system with unified AQL query language
The lack of GeometryCollection support is adequately addressed by our attestation model, which naturally accommodates multiple geometry attestations per Thing—each with its own provenance, temporal bounds, and certainty assessment.
The viability of ArangoDB depends on establishing sustainable licensing appropriate for an academic, non-commercial research infrastructure project funded primarily by University of Pittsburgh.
PostgreSQL with extensions provides a proven alternative that guarantees sustainability and includes full GeometryCollection support, but requires accepting greater query complexity (especially for graph traversals with Attestations as nodes) and careful performance optimization.
The database choice is a critical architectural decision that will shape WHG v4’s development timeline, operational requirements, and long-term sustainability.