4.1. Database Technology Assessment¶

4.1.1. TL;DR¶

WHG v4 is transitioning to a property graph data model centered on attestations (source-backed claims about historical places). We need a database that natively supports:

Property graphs with rich node and edge metadata
GeoJSON geometries for complex historical boundaries
Vector similarity search for phonetic name matching
Temporal queries
Single integrated system (solo technical team)

Assessment finding: ArangoDB is the optimal technical fit - native multi-model architecture matches our requirements precisely, operational simplicity suits our team size, and superior GeoJSON support handles complex historical geometries.

Critical dependency: Sustainable licensing for academic/non-commercial research infrastructure funded primarily by University of Pittsburgh with grant support.

Alternative: PostgreSQL with extensions provides fallback but requires accepting greater complexity.

Hosting: Prefer self-hosted at Pitt, but open to ArangoDB-managed solution depending on terms.

Next step: Licensing discussion to determine viability.

Data model details: See WHG v4 Data Model Documentation

4.1.2. Executive Summary¶

The World Historical Gazetteer (WHG) is undergoing a fundamental redesign of its data model for v4, transitioning from a traditional relational structure to a property graph model centered on attestations. This document evaluates database technologies suitable for this new architecture.

What is WHG?

Free, open-access research infrastructure for historical geography
Serves researchers worldwide with temporal-aware place data
Supports discovery, linking, and analysis of historical place information
Non-commercial academic project

Understanding the Data Model

WHG v4 is built around Things (entities such as locations, historical entities, collections, periods, routes, itineraries, and networks) connected through Attestations (nodes that bundle source-backed claims with temporal bounds and certainty assessments).

Complete technical details: WHG v4 Data Model Documentation

Key concepts:

Thing: Any entity that can be described (what we used to call “places” but now expanded)
Attestation: A node that bundles claims connecting a Thing to information (names, geometries, timespans, types, or other Things)
Authority: Unified reference data for sources, datasets, relation types, periods, and certainty levels
Provenance: Every claim cites sources via AUTHORITY nodes, has certainty assessment, and temporal bounds
Graph structure: Attestations are nodes (not just edges) in a property graph where edges connect Things → Attestations → Names/Geometries/Timespans/Things

Key Requirements:

Property graph with Attestations as first-class nodes (not just edge metadata)
Native GeoJSON support for complex historical geometries
Vector similarity search for phonetic name matching across scripts/languages
Temporal query capabilities
Single integrated system (critical for small technical team)
Sustainable licensing for academic/non-commercial use

Project Context:

Funding: Primarily supported by University of Pittsburgh with grant funding
Technical Team: Currently one developer
Scale: Anticipated 500GB-1TB working dataset, 73M+ nodes
Mission: Free, open-access research infrastructure for historical geography
Hosting: Intended self-hosted at University of Pittsburgh infrastructure

4.1.3. Data Volume Projection¶

4.1.3.1. Source Data Estimates¶

Source	Records	Avg Size	Total
GeoNames	12M	2KB	24GB
OpenStreetMap (subset)	6M	1KB	6GB
Wikidata Places	10M	3KB	30GB
Pleiades	38K	50KB	2GB
Contributed Data	1-5M	10KB	10-50GB
Subtotal	~28M	-	72-112GB

4.1.3.2. WHG Attestation Layer¶

Per Thing Estimates (average):

Node Collections:

5 NAME references (shared nodes, not duplicated) ≈ negligible per Thing
2 GEOMETRY nodes × 2KB (GeoJSON + derived fields) = 4KB
3 TIMESPAN nodes × 0.5KB (4-field temporal model) = 1.5KB
5 ATTESTATION nodes × 0.8KB (certainty, notes, metadata) = 4KB
3 AUTHORITY references (shared across many Things) ≈ negligible per Thing

Edge Collection:

~25 edges per Thing × 0.2KB = 5KB
- Thing → Attestation (subject_of): 5 edges
- Attestation → Name/Geometry/Timespan (attests_*): 10 edges
- Attestation → Authority (typed_by, sourced_by): 8 edges
- Attestation → Thing (relates_to for relations): 2 edges

Total per Thing: ~14.5KB

For 28M Things:

Node data (Geometries, Timespans, Attestations): 73M × 9.5KB = ~266GB
Edge data: 73M × 5KB = ~140GB
Subtotal per-Thing data: ~406GB

Shared Collections (across all Things):

NAME nodes: ~20GB (15M unique names with embeddings)
AUTHORITY nodes: ~0.1GB (56K shared authorities)
Subtotal shared data: ~426GB

Vector Embeddings:

Estimated unique NAME nodes: ~15M

Embedding storage:

15M unique names × 256 dimensions × 4 bytes = ~15GB raw embedding data
Plus name metadata: ~4.5GB
Total NAME collection: ~20GB

4.1.3.3. AUTHORITY Collection (Shared)¶

Estimate:

50K datasets and sources
500 relation types
5K period definitions (PeriodO)
Total: ~56K AUTHORITY records × 2KB = ~112MB

(Negligible compared to attestation layer - authorities are heavily reused)

4.1.3.4. Total Raw Data¶

Component	Size
Source data (Things + basic attributes)	72-112GB
Attestation layer (per-Thing nodes + edges)	406GB
Shared NAME nodes (with embeddings)	20GB
Shared AUTHORITY nodes	0.1GB
Total uncompressed	~500-540GB

4.1.3.5. Database-Specific Storage Characteristics¶

4.1.3.5.1. ArangoDB¶

Compression:

Document compression: 2-3x typical ratio
Edge compression: Similar 2-3x ratio
Vector indexes: Quantization can reduce by 4-8x
Estimated working set: 200-300GB

Index Overhead:

Geo indexes (S2): ~20% of geometry data = ~25GB
Vector indexes (FAISS IVF): ~30% of NAME embeddings = ~5GB
Full-text indexes: ~15% of text data = ~15GB
Edge indexes (automatic): ~10% of edge data = ~14GB
Total indexes: ~60GB

Final ArangoDB estimate: 260-360GB

4.1.3.5.2. PostgreSQL + Extensions¶

Index Overhead:

Vector index overhead (HNSW): ~40-50% of embedding data = ~8GB
PostGIS indexes (GIST): ~25% of geometry data = ~31GB
AGE graph indexes: ~15% of graph data = ~60GB
Full-text indexes: ~20% of text data = ~20GB
Total indexes: ~120GB

Final PostgreSQL estimate: 400-500GB

Note: PostgreSQL may require more storage than ArangoDB due to:

Multiple extension indexes with some overlap
AGE graph metadata duplication (stores both relational and graph views)
Less efficient edge storage (normalized tables vs. native edge collections)

4.1.3.6. Growth Projections¶

Timeline	Dataset Size	Storage (ArangoDB)	Storage (PostgreSQL)
Year 1	28M Things baseline	260-360GB	400-500GB
Year 5	+5M contributed Things	400-500GB	550-650GB

Growth characteristics: Modest incremental growth after initial bulk ingestion (~3-5% new Things annually) combined with attestation enrichment (~5-10% per year).

4.1.3.7. Memory Requirements¶

For optimal performance, working set should fit in RAM:

ArangoDB:

Active data: 200-300GB
Vector indexes: 5GB (frequently accessed)
Geo indexes: 25GB
Hot edge indexes: 14GB
Recommended RAM: 128GB (allows OS cache + query working memory)
Minimum acceptable: 64GB with aggressive caching

PostgreSQL:

Shared buffers: 25-40% of RAM (50-75GB)
Vector index cache: 8GB
PostGIS geometry cache: 31GB
AGE graph cache: 60GB
OS page cache: remainder
Recommended RAM: 192GB (for comfortable operation with all extensions)
Minimum acceptable: 128GB with careful tuning

Note: ArangoDB’s lower memory requirements reflect more efficient integrated indexing vs. PostgreSQL’s multiple independent extension caches.

4.1.3.8. Storage I/O Requirements¶

Random reads (most queries involve):

Graph traversals (edge lookups)
Vector similarity search (index navigation)
Spatial queries (geo index traversal)

NVMe SSD strongly recommended:

~100K IOPS for concurrent queries
<1ms latency for graph hops
Sustained throughput for bulk operations

Capacity: 5-10TB to accommodate:

Database growth (5 years)
Backup snapshots
WAL/journal files
Temporary query space

4.1.4. Option 1: ArangoDB¶

4.1.4.1. Technical Advantages¶

1. Optimal Data Model Alignment

ArangoDB’s native multi-model architecture (graph + document + geospatial) maps directly to WHG’s attestation model where Attestations are nodes (not just edges):

// Things as documents
{
  "_key": "constantinople",
        "_id": "things/constantinople",
        "thing_type": "location",
        "description": "Major Byzantine/Ottoman city on the Bosphorus",
        "primary_name": "Constantinople",  // denormalized
        "representative_point": [28.98, 41.01]  // denormalized
}

// Names as documents with embeddings
{
  "_key": "name-istanbul-tr",
        "_id": "names/name-istanbul-tr",
        "name": "İstanbul",
        "language": "tr",
        "script": "Latn",
        "embedding": [0.234, -0.567, ...]  // 256-dimensional
}

// Attestations as nodes (not edges)
{
  "_key": "att-001",
        "_id": "attestations/att-001",
        "certainty": 1.0,
        "certainty_note": "Official administrative name change",
        "notes": "Adopted after establishment of Turkish Republic"
}

// Edges connect the graph
{
  "_from": "things/constantinople",
        "_to": "attestations/att-001",
        "edge_type": "subject_of"
}

{
  "_from": "attestations/att-001",
        "_to": "names/name-istanbul-tr",
        "edge_type": "attests_name"
}

{
  "_from": "attestations/att-001",
        "_to": "authorities/source-turkish-gov",
        "edge_type": "sourced_by"
}

Key advantage: Attestations as nodes enables meta-attestations (attestations about attestations), rich provenance chains, and natural modeling of scholarly discourse.

2. Unified AUTHORITY Collection

Single table inheritance via authority_type discriminator replaces multiple collections:

// Source authority
{
    "_id": "authorities/al-tabari",
    "authority_type": "source",
    "citation": "Al-Tabari, History of the Prophets and Kings",
    "source_type": "manuscript"
}

// Dataset authority
{
    "_id": "authorities/dataset-123",
    "authority_type": "dataset",
    "title": "Islamic Cities Database",
    "doi": "doi:10.83427/whg-dataset-123"
}

// Relation type authority
{
    "_id": "authorities/member-of",
    "authority_type": "relation_type",
    "label": "member_of",
    "inverse": "contains",
    "description": "Subject is part of object entity"
}

// Period authority (PeriodO integration)
{
    "_id": "authorities/abbasid",
    "authority_type": "period",
    "label": "Abbasid Caliphate",
    "uri": "periodo:p0abbasid",
    "start_earliest": "750-01-01",
    "stop_latest": "1258-12-31"
}

This eliminates multiple table joins and simplifies queries.

3. Superior GeoJSON Support

ArangoDB provides native GeoJSON storage and querying with support for Point, MultiPoint, LineString, MultiLineString, Polygon, and MultiPolygon types:

// Store complex geometries
db.geometries.insert({
    _key: "geom-constantinople",
    geom: {
        type: "MultiPolygon",
        coordinates: [
            // First polygon: city center
            [[
                [28.94, 41.01],
                [29.00, 41.01],
                [29.00, 41.05],
                [28.94, 41.05],
                [28.94, 41.01]
            ]],
            // Second polygon: disputed boundary region
            [[
                [28.90, 41.00],
                [28.92, 41.00],
                [28.92, 41.02],
                [28.90, 41.02],
                [28.90, 41.00]
            ]]
        ]
    },
    representative_point: [28.97, 41.03],
    precision: ["historical_approximate"]
})

// Geospatial queries
FOR geom IN geometries
  FILTER GEO_DISTANCE(geom.representative_point, @query_point) < 50000
  RETURN geom

Key advantages:

S2-based geospatial indexing for spherical geometry
Supports all standard GeoJSON types (Point, MultiPoint, LineString, MultiLineString, Polygon, MultiPolygon)
Native handling of complex historical geometries without conversion
No encoding or workarounds required

Note on GeometryCollection: ArangoDB does not support the GeometryCollection type. For WHG data containing heterogeneous geometry sets (e.g., a place represented by both a point and a polygon), we would either:

Store multiple geometry attestations (one per geometry type) - aligns naturally with our attestation model where each geometry is a separate claim
Convert to MultiPolygon by buffering points and lines into small polygons where single-geometry representation is required

Comparison: PostgreSQL PostGIS supports GeometryCollection natively; Neo4j spatial plugin has limited GeoJSON support; Vespa has no GeoJSON support.

4. Unified Query Language (AQL)

AQL integrates graph traversal, document filtering, geospatial queries, and vector similarity in a single, coherent language. With our model where Attestations are nodes:

// Complex query: Find places with similar names near Constantinople in 13th century
LET query_embedding = @query_vector
LET query_point = [28.98, 41.01]
LET query_start = DATE_TIMESTAMP("1200-01-01")
LET query_end = DATE_TIMESTAMP("1300-12-31")

FOR thing IN things
  // Spatial constraint
  FILTER GEO_DISTANCE(thing.representative_point, query_point) < 50000
  
  // Find names via graph traversal
  LET matching_names = (
    FOR att IN attestations
      FOR e1 IN edges
        FILTER e1._from == thing._id
        FILTER e1._to == att._id
        FILTER e1.edge_type == "subject_of"
        
        FOR e2 IN edges
          FILTER e2._from == att._id
          FILTER e2.edge_type == "attests_name"
          LET name = DOCUMENT(e2._to)
          
          // Vector similarity (index-accelerated)
          LET similarity = APPROX_NEAR_COSINE(name.embedding, query_embedding)
          FILTER similarity > 0.8
          
          // Check temporal validity
          FOR e3 IN edges
            FILTER e3._from == att._id
            FILTER e3.edge_type == "attests_timespan"
            LET ts = DOCUMENT(e3._to)
            FILTER ts.start_latest <= query_end
            FILTER ts.stop_earliest >= query_start
            
            RETURN {name: name.name, similarity: similarity}
  )
  
  FILTER LENGTH(matching_names) > 0
  
  // Find network connections
  LET trade_partners = (
    FOR att IN attestations
      FOR e1 IN edges
        FILTER e1._from == thing._id
        FILTER e1._to == att._id
        
        FOR e2 IN edges
          FILTER e2._from == att._id
          FILTER e2.edge_type == "typed_by"
          LET rel_type = DOCUMENT(e2._to)
          FILTER rel_type.label == "connected_to"
          
          FOR e3 IN edges
            FILTER e3._from == att._id
            FILTER e3.edge_type == "relates_to"
            LET partner = DOCUMENT(e3._to)
            RETURN partner
  )
  
  LIMIT 10
  RETURN {
    thing: thing,
    names: matching_names,
    trade_partners: trade_partners
  }

This eliminates the need to combine multiple query languages or systems. The graph traversal naturally handles the Attestation-as-node pattern.

5. Operational Simplicity

Critical for small team: Single system handles:

✅ Graph queries (Attestations as nodes, multi-hop traversals)
✅ Document search (full-text, filtering)
✅ Vector similarity (name matching via embeddings)
✅ Geospatial (GeoJSON native)
✅ Temporal queries (efficient range queries)

Impact:

One database to maintain and monitor
Single backup/recovery strategy
Unified deployment process
No data synchronization between systems
Reduced cognitive load for solo developer

6. JSON-Native Architecture

Every node is a JSON document, matching WHG’s data modeling approach and facilitating:

Flexible schema evolution (AUTHORITY discriminator pattern)
Natural representation of complex temporal structures
Direct mapping from Linked Places Format (LPF)
Easy integration with Django REST framework

7. Vector Indexes for Phonetic Name Matching

ArangoDB provides vector indexes for approximate nearest neighbor search, powered by the FAISS library:

db.names.ensureIndex({
  type: "vector",
  fields: ["embedding"],
  params: {
    metric: "cosine",
    dimension: 384  // or dimension appropriate for model
  }
})

AQL Vector Search uses the APPROX_NEAR_COSINE() function for index-accelerated queries:

FOR name IN names
  LET similarity = APPROX_NEAR_COSINE(name.embedding, @query_embedding)
  FILTER similarity > 0.8
  SORT similarity DESC
  LIMIT 10
  RETURN {name: name.name, similarity: similarity}

WHG Use Case: Our embeddings are derived from phonetic representations of toponyms using a Siamese BiLSTM model trained on GeoNames data with PanPhon feature sequences of IPA transcriptions. These phonetic embeddings encode pronunciation similarity rather than semantic meaning.

Names across different scripts (e.g., “Constantinople” ↔ “Κωνσταντινούπολις” ↔ “القسطنطينية”)
Transliteration variations (e.g., “Samarkand” ↔ “Samarqand” ↔ “Самарканд”)
Historical name forms that sound similar but are spelled differently

Scale: Name similarity search across 10M+ name variants with phonetic embeddings.

Note: AQL also provides a non-indexed COSINE_SIMILARITY() function for exact similarity calculations, but APPROX_NEAR_COSINE() is the index-accelerated function required for performance at scale. ArangoDB’s vector search capabilities were added more recently than its core graph and geospatial features, so early benchmarking with WHG’s specific phonetic embedding workload is recommended to validate performance characteristics.

4.1.4.2. Deployment Considerations¶

Preferred Approach: Self-Hosted at University of Pittsburgh

WHG intends to host ArangoDB on University of Pittsburgh infrastructure (Pitt CRC).

Hardware Requirements (based on 260-360GB working set):

CPU:

Minimum: 16 cores
Recommended: 24-32 cores
Purpose: Query parallelization, vector index operations, graph traversals, concurrent user requests
Note: Multi-core critical for AQL query optimization and FAISS vector search

RAM:

Minimum: 64GB
Recommended: 128GB
Breakdown:
- Active data in memory: ~200-300GB working set target
- Vector indexes (hot): ~5GB
- Geo indexes: ~25GB
- Edge indexes: ~14GB
- ArangoDB processes: ~10-15GB
- OS and cache: ~20-30GB
Rationale: Keeping frequently-accessed data in RAM is critical for sub-100ms query response

Storage:

Primary Database: 1.5TB NVMe SSD
Backup Storage: 3-4TB (separate volume)
- 3x full database snapshots
- Incremental backups
- Export archives
Performance Requirements:
- NVMe SSD strongly recommended (not SATA SSD)
- ~50K+ random IOPS for graph traversals
- <1ms latency for edge lookups
- Sustained read: 2-3GB/s
- Sustained write: 1-2GB/s

Network:

Internal: 10Gbps minimum for single-server deployment
If clustered (3+ nodes): 25Gbps low-latency interconnect
External: Standard university network connection sufficient for web application traffic

Software Requirements:

OS: Linux (Ubuntu 22.04 LTS or RHEL 9 recommended)
Database: ArangoDB Enterprise Edition 3.12+
Monitoring: Prometheus + Grafana (for metrics dashboards)
Backup: ArangoDB’s built-in tools + custom scripts
Optional: Docker/Kubernetes for containerized deployment

Operational Requirements:

System administration: Pitt CRC support for:
- Initial server provisioning
- OS-level maintenance and security patches
- Network configuration
- Storage management
Database administration (can be handled by WHG developer with training):
- Query optimization
- Index tuning
- Backup management
- Monitoring and alerting configuration
Backup strategy:
- Automated nightly snapshots
- Weekly full backups
- Offsite replication (Pitt backup infrastructure)
- 30-day retention policy
Monitoring and alerting:
- Query performance metrics
- Resource utilization (CPU, RAM, disk I/O)
- Error rates and warnings
- Backup success/failure notifications

Alternative: ArangoDB-Managed Hosting (ArangoGraph)

Given practical considerations, we are open to ArangoDB-managed hosting as an alternative to self-hosting.

Benefits of managed hosting:

Reduces operational burden on solo developer
Professional 24/7 database administration
Automated backups and updates
Better resource scaling and elasticity
Faster time to production
No hardware procurement lead time
Included monitoring and alerting

Tradeoffs:

Potentially higher ongoing costs vs. self-hosted
Less control over infrastructure
Dependency on vendor availability
Data egress costs if migrating away

Consideration: Managed hosting pricing should be part of licensing discussion, as it may provide better total cost of ownership than self-hosted given staffing constraints and solo developer operational burden.

Recommendation: Start with pricing quote for managed hosting (ArangoGraph) to compare against self-hosted TCO before making final deployment decision.

4.1.4.3. Limitations¶

1. Licensing Constraints

Community Edition (v3.12+):

100 GiB dataset limit per cluster
Restricted to non-commercial use under BSL 1.1
Insufficient for WHG (requires 600-750GB)

Enterprise Edition:

Required for WHG’s dataset size
Licensing terms require negotiation
Academic/non-commercial pricing available but not published
Uncertainty in long-term cost sustainability

2. GeometryCollection Not Supported

ArangoDB does not support the GeoJSON GeometryCollection type. For WHG:

Workaround: Store separate geometry attestations for each geometry type - this aligns perfectly with our attestation model where each geometry is a separate claim with its own provenance
Alternative: Convert to MultiPolygon where needed by buffering points/lines
Impact: Minimal - our attestation-as-node model naturally accommodates multiple geometries per Thing, each with separate sources and temporal bounds

3. Vendor Ecosystem

Smaller community than PostgreSQL or Neo4j
Proprietary query language (AQL) creates some lock-in
Fewer third-party tools and integrations
Limited pool of developers with ArangoDB experience

4. Vector Search Maturity

Vector indexes powered by FAISS library
Index-accelerated similarity search via APPROX_NEAR_COSINE() function
Less battle-tested than specialized systems (Vespa, pgvector) for very large vector datasets
Performance characteristics at 10M+ vectors should be validated
Critical: Early benchmarking with WHG’s phonetic embedding workload (10M+ name variants) is essential to validate that performance meets requirements before committing to this architecture

4.1.4.4. Assessment for WHG¶

ArangoDB represents the best technical fit for WHG’s requirements:

Attestations as nodes enable rich provenance and meta-attestations
AUTHORITY collection pattern simplifies queries
Graph traversal naturally handles complex relationship patterns
Excellent GeoJSON support (GeometryCollection limitation addressed by our model)
Unified architecture for small team
Scales to anticipated dataset size
Single system reduces operational complexity

Critical question: Can sustainable licensing be negotiated for academic/research use?

4.1.5. Option 2: PostgreSQL + Extensions¶

4.1.5.1. Architecture¶

PostgreSQL 16+ with:

PostGIS: Geospatial (GeoJSON, spatial indexes)
pgvector: Vector similarity search (embeddings)
Apache AGE: Graph database (Cypher-compatible queries)
pg_trgm: Fuzzy text matching
Full-text search: Built-in (tsquery/tsvector)

4.1.5.2. Advantages¶

1. Licensing & Sustainability ⭐

PostgreSQL License (permissive, OSI-approved)
No dataset size limits
No commercial restrictions
No vendor lock-in
Sustainable for academic use indefinitely

2. Proven at Scale

Powers massive applications (Instagram, Spotify, Reddit)
Organizations successfully run multi-TB Postgres deployments
Excellent query optimizer
Mature partitioning and sharding
30+ years of continuous development

3. Excellent Geospatial (PostGIS)

Industry standard for GIS
Full GeoJSON support: geometry(GEOMETRY, 4326) columns
Supports GeometryCollection: Can store heterogeneous geometry types per record
Sophisticated spatial indexing (GIST, SP-GIST)
Rich spatial functions
Superior to ArangoDB for GeometryCollection support

Example:

-- Store GeoJSON including GeometryCollection
CREATE TABLE geometries (
  id UUID PRIMARY KEY,
  geom GEOMETRY(GEOMETRYCOLLECTION, 4326),
  precision VARCHAR(20)
);

INSERT INTO geometries VALUES (
  gen_random_uuid(),
  ST_GeomFromGeoJSON('{
    "type": "GeometryCollection",
    "geometries": [
      {"type": "Point", "coordinates": [28.98, 41.01]},
      {"type": "Polygon", "coordinates": [[[28.94, 41.01], ...]]}
    ]
  }'),
  'mixed'
);

-- Geospatial query
SELECT * FROM geometries
WHERE ST_DWithin(
  geom,
  ST_SetSRID(ST_MakePoint(28.98, 41.01), 4326),
  50000  -- 50km
);

4. Strong Vector Search (pgvector)

HNSW indexes for approximate nearest neighbor search
Handles millions of vectors efficiently
Active development (major improvements 2023-2024)
Mature enough for production use
Suitable for phonetic name embeddings

Example:

CREATE TABLE names (
  id UUID PRIMARY KEY,
  name TEXT,
  phonetic_embedding vector(384)  -- from phonetic representation
);

CREATE INDEX ON names USING hnsw (phonetic_embedding vector_cosine_ops);

-- Phonetic similarity query
SELECT name, 1 - (phonetic_embedding <=> query_vector) AS similarity
FROM names
ORDER BY phonetic_embedding <=> query_vector
LIMIT 10;

5. Graph Support via Apache AGE

Apache AGE enables property graph modeling in PostgreSQL with Cypher queries. For our Attestation-as-node model:

-- Create graph schema
SELECT create_graph('whg_graph');

-- Insert nodes (Things, Attestations, Names, Authorities)
SELECT * FROM cypher('whg_graph', $$
  CREATE (t:Thing {id: 'constantinople', type: 'location'})
  CREATE (a:Attestation {id: 'att-001', certainty: 1.0})
  CREATE (n:Name {id: 'name-istanbul', name: 'İstanbul'})
  CREATE (auth:Authority {id: 'source-123', type: 'source'})
  CREATE (t)-[:SUBJECT_OF]->(a)
  CREATE (a)-[:ATTESTS_NAME]->(n)
  CREATE (a)-[:SOURCED_BY]->(auth)
$$) AS (result agtype);

-- Query graph
SELECT * FROM cypher('whg_graph', $$
  MATCH (t:Thing)-[:SUBJECT_OF]->(a:Attestation)-[:ATTESTS_NAME]->(n:Name)
  WHERE t.id = 'constantinople'
  RETURN t, a, n
$$) AS (thing agtype, attestation agtype, name agtype);

Note: AGE is newer than other PostgreSQL extensions and less mature than Neo4j, but actively developed.

6. Massive Ecosystem

Huge community (largest of any database)
Extensive tooling (monitoring, backup, replication)
Universal hosting provider support
Easy to recruit developers
Abundant learning resources
Well-known to university IT departments

4.1.5.3. Deployment Considerations¶

Self-Hosted at University of Pittsburgh (Pitt CRC)

Hardware Requirements (based on 400-500GB working set):

CPU:

Minimum: 16 cores
Recommended: 32 cores
Purpose: Parallel query execution, multiple extension operations (PostGIS + pgvector + AGE), indexing
Note: PostgreSQL benefits significantly from higher core counts when running multiple extensions simultaneously

RAM:

Minimum: 128GB (with careful tuning)
Recommended: 192GB
Rationale: Multiple extensions each maintain separate caches; higher RAM reduces extension coordination overhead
Note: PostgreSQL + extensions typically requires more RAM than ArangoDB for equivalent dataset due to less integrated architecture

Storage:

Primary Database: 2TB NVMe SSD
Backup Storage: 4-5TB (separate volume)
- 3x full database dumps
- Incremental WAL archives
- Extension-specific backups
Performance Requirements:
- NVMe SSD required (SATA SSD insufficient for graph + spatial queries)
- ~60K+ random IOPS for combined extension operations
- <1ms latency for index operations
- Sustained read: 3-4GB/s
- Sustained write: 2-3GB/s

Network:

Internal: 10Gbps minimum
If replicated (primary + standby): 10Gbps+ low-latency
External: Standard university network connection

Software Requirements:

OS: Linux (Ubuntu 22.04 LTS or RHEL 9)
Database: PostgreSQL 16+
Extensions:
- PostGIS 3.4+
- pgvector 0.5+
- Apache AGE 1.5+
- pg_trgm (included)
Connection Pooler: PgBouncer (required for AGE)
Monitoring: Prometheus + Grafana + PostgreSQL-specific exporters
Backup: pg_dump, WAL archiving, Barman (recommended)

Operational Requirements:

Initial setup complexity: Higher than ArangoDB due to multiple extensions
Configuration tuning: Each extension requires separate parameter tuning
Backup strategy:
- Full database dumps (pg_dump): weekly
- WAL archiving: continuous
- Extension metadata backups: daily
- Point-in-time recovery: 30-day window
Monitoring: Must track metrics for each extension separately
Maintenance: VACUUM, ANALYZE, REINDEX schedules must account for extension overhead

Note on Extension Coordination:

PostGIS, pgvector, and AGE are developed independently
Updates to one extension may impact others
Compatibility testing required before upgrading any component
More complex troubleshooting when issues span multiple extensions
Note: May require more resources than ArangoDB due to extension overhead

Software Requirements:

Linux (Ubuntu 20.04+ or RHEL 8+)
PostgreSQL 16+
PostGIS extension
pgvector extension
Apache AGE extension
Connection pooler (PgBouncer)
Monitoring (PostgreSQL-specific tools well-established)

4.1.5.4. Limitations¶

1. Graph Queries Less Elegant

Apache AGE provides Cypher queries over PostgreSQL, but:

Newer extension (2021), less mature than Neo4j or ArangoDB
Graph traversals slower than native graph databases
Combining SQL and Cypher requires switching contexts
Multi-hop queries can require careful optimization
Attestations-as-nodes pattern adds complexity - more node hops than if attestations were just edge properties

2. Multi-Extension Complexity

Three extensions to coordinate (PostGIS, pgvector, AGE), each with its own query patterns, index types, and performance characteristics.

Impact: More cognitive overhead than unified system, though manageable.

3. Unified Query Complexity

Combining all capabilities requires mixing SQL, Cypher, and spatial functions - significantly more verbose than ArangoDB’s unified AQL.

Example of the same query as ArangoDB above would require:

Cypher for graph traversal (Attestations as nodes)
SQL for vector similarity (pgvector functions)
PostGIS for spatial filtering
Complex CTEs or subqueries to combine results

4. Graph Performance Ceiling

Not optimized for graph traversals
Deep graph queries (4+ hops) may require denormalization
Attestation-as-node pattern increases hop count (Thing → Attestation → Name is 2 hops vs. 1 edge)
Network analysis algorithms less efficient than native graph databases

4.1.5.5. Assessment for WHG¶

PostgreSQL provides a viable alternative with:

✅ Guaranteed sustainability (free, permissive license)
✅ All required capabilities (including GeometryCollection)
✅ Proven at scale
✅ Familiar to university IT environments

Trade-offs:

More complex query patterns (especially with Attestations as nodes)
Requires more careful optimization
Multiple extensions to coordinate
Graph performance ceiling for deep traversals

4.1.6. Option 3: Neo4j¶

4.1.6.1. Advantages¶

1. Best Graph Performance

Native graph storage and processing
Optimized for multi-hop traversals
Attestations as nodes is natural pattern

2. Mature Graph Capabilities

Cypher query language well-established
Rich graph algorithms library
Excellent for complex relationship queries

4.1.6.2. Limitations¶

1. Weaker Spatial Support

Limited GeoJSON support, no GeometryCollection support
Spatial plugin less mature than PostGIS
Would likely need secondary system for complex spatial queries

2. Vector Search in Enterprise Only

Enterprise Edition required for vector indexes (phonetic embeddings)
Introduces licensing costs
No vector search in Community Edition

3. Multi-System Architecture Required

Would need separate database for spatial/vector operations
Problematic for solo developer
Data synchronization complexity

4.1.6.3. Assessment for WHG¶

Not recommended given:

Need for secondary system (spatial/vector)
Licensing costs for Enterprise Edition
Operational complexity for small team

4.1.7. Option 4: Vespa¶

4.1.7.1. Why Vespa Was Initially Attractive¶

Vespa was selected for WHG v4 when the architectural priorities emphasized:

1. Vector Search Excellence

Vespa offers industry-leading vector similarity search capabilities:

Native approximate nearest neighbor (ANN) search with HNSW indexes
Optimized for billions of vectors with sub-10ms query latency
Multiple distance metrics (cosine, euclidean, angular, hamming)
Hybrid search combining BM25 text ranking with vector similarity
Real-time updates and indexing

For WHG’s phonetic name matching use case, Vespa represented the gold standard:

Cross-linguistic toponym matching via phonetic embeddings
Matching across different scripts (Latin, Arabic, Chinese, Cyrillic, etc.)
Handling transliteration variations
Scale to 10M+ name variants with high recall

2. Unified Search Platform

Vespa positioned itself as a single system for:

Full-text search (BM25, linguistic features)
Vector similarity search
Structured data queries
Real-time indexing and updates

This “all-in-one” promise aligned with our goal of operational simplicity for a small team.

3. Scalability and Performance

Battle-tested at massive scale (Yahoo, Spotify, OkCupid)
Content distribution capabilities for global access
Low-latency queries even with complex ranking
Horizontal scaling for growing datasets

4. Open Source with Strong Backing

Apache 2.0 license (no licensing concerns)
Backed by Verizon Media/Yahoo
Active development and community
Professional support available

4.1.7.2. Why the Data Model Redesign Changed Everything¶

The evolution toward an attestation-based property graph model fundamentally altered our requirements in ways that expose Vespa’s limitations:

4.1.7.3. Critical Limitation 1: Not a Graph Database¶

The Problem:

Vespa is a document search engine, not a graph database. Our attestation model requires:

Attestations as first-class nodes with their own identifiers
Multi-hop graph traversals: Thing → Attestation → Name → Authority → Period
Bidirectional relationships: Finding all Things connected to a Source, or all Sources supporting an Attestation
Meta-attestations: Attestations about other Attestations (scholarly discourse)
Provenance chains: Tracing claims through multiple levels of citation

Vespa’s Approach:

Vespa treats everything as documents with embedded relationships. To model our graph:

// Vespa document - all relationships embedded
{
  "thing_id": "constantinople",
  "names": [
    {
      "name": "Constantinople",
      "attestation_id": "att-001",
      "sources": ["source-123"],
      "certainty": 0.9,
      "timespan": {...}
    },
    {
      "name": "İstanbul", 
      "attestation_id": "att-002",
      "sources": ["source-456"],
      "certainty": 1.0,
      "timespan": {...}
    }
  ],
  "geometries": [...],
  "relations": [...]
}

Problems with this approach:

Attestations lose identity: They become anonymous nested objects, not addressable entities
No meta-attestations: Cannot create attestations about attestations when attestations aren’t nodes
Duplication: Same source/timespan data duplicated across many documents
Query complexity: Finding “all Things citing source X” requires scanning all documents
Update complexity: Changing a source citation requires updating potentially thousands of documents
No graph algorithms: Cannot run centrality analysis, shortest path, community detection on embedded data

Graph Database Approach (ArangoDB/Neo4j):

// Attestation is a node
{
  "_id": "attestations/att-001",
  "certainty": 0.9,
  "notes": "..."
}

// Connected via edges
Thing ──[subject_of]──> Attestation ──[attests_name]──> Name
                            │
                            └──[sourced_by]──> Authority
                            │
                            └──[attests_timespan]──> Timespan

Benefits:

Each entity addressable
Bidirectional traversal native
Meta-attestations natural (new edges to attestation nodes)
Updates affect only changed nodes
Graph algorithms work natively

4.1.7.4. Critical Limitation 2: No GeoJSON Support¶

The Problem:

Historical places have complex geometries:

Multiple disputed boundaries for the same period
Territorial changes over time
Points, polygons, and linestrings for the same entity
Imprecise historical regions

Required GeoJSON types:

Point (settlements, monuments)
MultiPoint (scattered settlement patterns)
LineString (roads, rivers, walls)
MultiLineString (trade routes, river systems)
Polygon (territories, regions)
MultiPolygon (non-contiguous territories, islands)
GeometryCollection (mixed geometry types)

Vespa’s Spatial Capabilities:

Vespa only supports 2D positions (latitude/longitude pairs):

No polygon support
No linestring support
No multi-geometry support
No GeometryCollection

Workaround Required:

To store a historical territory boundary in Vespa:

{
  "location": [28.98, 41.01],  // Single representative point
  "boundary_encoded": "base64_encoded_geojson_string",  // Custom encoding
  "boundary_wkt": "POLYGON((...))"  // Alternative encoding
}

Problems:

No spatial queries: Cannot query “places within this polygon” or “territories intersecting this region”
Custom decoding: Application must decode geometries for display
No spatial indexing: Representative point only, not actual boundaries
Geometry comparison: Cannot compute overlaps, containment, distance to polygons

Result: Would need a secondary spatial database (PostGIS) anyway, defeating the “single system” goal.

4.1.7.5. Critical Limitation 3: Document Model vs. Attestation Model¶

WHG’s Attestation Philosophy:

Every claim is:

Explicitly sourced (one or more sources)
Temporally bounded (valid during specific timespan)
Uncertainty-qualified (certainty level with explanation)
Independently assessable (can be challenged, supported, superseded)

Example: “Constantinople was called İstanbul (1930-present)”

This is one attestation node connecting:

Thing: Constantinople
Name: İstanbul
Authority (source): Turkish Geographic Board
Authority (relation type): has_name
Timespan: 1930-present
Certainty: 1.0

Vespa’s Document Model:

Would require embedding this information structure within a Constantinople document:

{
  "thing_id": "constantinople",
  "names": [{
    "name": "İstanbul",
    "language": "tr",
    "sources": ["Turkish Geographic Board"],
    "start_year": 1930,
    "certainty": 1.0
  }]
}

What’s Lost:

Source reification: “Turkish Geographic Board” is just a string, not a linked entity with its own metadata (authority type, URI, dates)
Relation types as data: “has_name” is implicit in the structure, not an entity with inverse relationships, domain/range constraints
Temporal precision: Reduced to simple year fields instead of four-field PeriodO model with earliest/latest bounds
Provenance depth: Cannot represent that “Source A cited Source B” or “Attestation X supersedes Attestation Y”
Scholarly discourse: Cannot model debates, challenges, revisions to attestations

4.1.7.6. Critical Limitation 4: Query Pattern Mismatch¶

Query Examples Our Model Requires:

“What sources support the claim that X was called Y in period Z?”

Graph: Traverse Thing → Attestations → Names (filter) → Sources + Timespans
Vespa: Scan all documents, filter embedded names array, extract source strings

“Which places cite source S and have uncertain boundaries?”

Graph: Traverse Source → Attestations (filter certainty) → Things → Geometries (filter precision)
Vespa: Full corpus scan checking embedded source strings and geometry metadata

“Show scholarly disagreement about location X’s temporal bounds”

Graph: Find all Timespan attestations for X, compare, identify conflicts
Vespa: Parse embedded temporal data from document, requires application logic

“Find attestations about attestation A” (meta-attestations)

Graph: Native - just query edges pointing to attestation node A
Vespa: Impossible - attestations aren’t addressable entities

4.1.7.7. What Vespa Does Exceptionally Well (But We Don’t Need)¶

1. Content Ranking and Recommendation

Vespa excels at:

Hybrid BM25 + vector search with sophisticated ranking models
Learning-to-rank with ML models for relevance tuning
Personalized recommendations
A/B testing ranking strategies

WHG’s needs are different: While we do need to rank candidates within reconciliation clusters (by combined similarity scores: name + spatial + temporal), this is straightforward scoring logic that can be implemented in AQL queries. Vespa’s advanced learning-to-rank and personalization features (designed for search engines and recommendation systems) exceed our simpler multi-dimensional similarity requirements.

2. Real-Time Streaming Updates

Vespa handles:

High-velocity data ingestion
Immediate index updates
Partial document updates

WHG doesn’t need this: Historical data changes slowly; batch updates are sufficient.

3. Content Distribution

Vespa provides:

Multi-region deployment
Content replication
Edge caching

WHG doesn’t need this: Single-region deployment at Pitt is adequate; we’re not a CDN.

4.1.7.8. The Bottom Line¶

Vespa was the right choice for a vector-search-first architecture with embedded relationships.

Vespa is the wrong choice for a property graph architecture with reified attestations.

The data model redesign privileged:

Historiographical precision over search ranking
Provenance depth over query speed
Scholarly discourse modeling over content distribution
Graph semantics over document embeddings

These priorities make Vespa fundamentally unsuitable, despite its vector search excellence.

4.1.7.9. Why Not Use Vespa for Vectors + Another DB for Graph?¶

Considered Multi-System Architecture:

Vespa: Vector search on names
Neo4j/ArangoDB: Graph relationships
Sync embeddings from graph to Vespa

Problems:

Complexity explosion: Two databases to maintain, monitor, backup
Synchronization overhead: Keeping vector index aligned with graph
Query splitting: Every search requires querying both systems and merging results
Operational burden: Unacceptable for solo developer
Data integrity: Risk of desynchronization between systems

ArangoDB’s Answer:

FAISS-backed vector indexes provide 80% of Vespa’s vector performance with native graph support, eliminating the need for multiple systems. The 20% performance gap is acceptable given operational simplicity.

4.1.7.10. Assessment¶

Vespa is unsuitable for the attestation-based property graph model that defines WHG v4. While its vector search capabilities are best-in-class, the fundamental mismatch between document-oriented search and graph-oriented historical knowledge representation makes it an architectural dead-end for our requirements.

The shift from Vespa to ArangoDB reflects the evolution from “gazetteer as search engine” to “gazetteer as historical knowledge graph.”

4.1.8. Detailed Comparison Matrix¶

Feature	ArangoDB	PostgreSQL + Extensions	Neo4j Enterprise	Vespa
Licensing Model	BSL 1.1 CE / Commercial EE	PostgreSQL (permissive)	GPL CE / Commercial EE	Apache 2.0
Academic Viability	Requires negotiation	✅ Guaranteed	Requires negotiation	✅ Guaranteed
Dataset Limits	100GB CE / None EE	✅ None	✅ None	✅ None
Attestations as Nodes	✅✅✅ Natural	✅ Via AGE	✅✅✅ Natural	❌ Manual
Graph Queries	✅✅ Native AQL	⚠️ AGE/Cypher	✅✅✅ Native Cypher	❌ Manual
Graph Performance	✅✅ Excellent	⚠️ Good with tuning	✅✅✅ Best	❌ Poor
Vector Search	✅ Good (v3.10+)	✅ Good (pgvector)	✅ EE only (v5.11+)	✅✅✅ Best
Phonetic Embeddings	✅ Supported	✅ Supported	✅ EE only	✅✅✅ Ideal
GeoJSON Support	✅✅ Native (6 types)	✅✅✅ PostGIS (all types)	⚠️ Plugin, limited	❌ None
GeometryCollection	❌ Not supported	✅✅✅ Fully supported	❌ Not supported	❌ Not supported
Single System	✅✅✅ Yes	✅✅✅ Yes	❌ Need additional DB	✅ Yes (but inadequate)
Query Elegance	✅✅✅ Unified AQL	⚠️ SQL + Cypher + spatial	✅✅ Cypher	❌ Complex custom code
Operational Complexity	✅ Low	✅ Low-Medium	❌ High (multi-system)	⚠️ Medium
Fit for WHG Model	✅✅✅ Excellent	✅✅ Very Good	⚠️ Graph only	❌ Poor

4.1.9. Recommendation¶

4.1.9.1. Primary Option: ArangoDB¶

ArangoDB represents the best technical fit for WHG v4 based on:

Data Model Alignment: Native property graph with Attestations as nodes (not just edge metadata) directly matches our architecture (see WHG v4 Data Model)
AUTHORITY Pattern: Single collection with discriminator pattern simplifies queries
GeoJSON Support: Native support for 6 GeoJSON types sufficient for our needs; GeometryCollection limitation naturally addressed by our multi-attestation model where each geometry is a separate claim
Query Integration: Unified AQL for graph traversal + spatial + vector + temporal operations reduces complexity
Operational Simplicity: Single system manageable by small team
Adequate Scale: Handles anticipated ~500GB dataset
Vector Capabilities: Suitable for phonetic name embedding search

Critical dependency: Securing sustainable licensing terms for academic/research use.

WHG Project Context:

Non-commercial research infrastructure
Free access to all users
Primary funding from University of Pittsburgh
Grant-dependent for operational costs
Solo technical team
Long-term sustainability essential

Deployment Preference: Self-hosted at Pitt, but open to ArangoDB-managed hosting if terms are favorable and reduce operational burden.

Request: Academic licensing discussion with ArangoDB to determine viability, including both self-hosted and managed hosting options.

4.1.9.2. Alternative: PostgreSQL + Extensions¶

PostgreSQL provides a viable fallback with:

Guaranteed long-term sustainability
Full GeometryCollection support via PostGIS
Graph capabilities via Apache AGE (with Attestations as nodes)
All required capabilities (with greater complexity)

Scenario: If ArangoDB licensing proves unsustainable, PostgreSQL is a defensible alternative that meets all requirements, though with more complex query patterns for our graph model.

4.1.10. Next Steps¶

Initiate licensing discussion with ArangoDB regarding academic use case
Provide project details: scale, funding model, technical team size, mission
Explore deployment options: self-hosted vs. managed hosting pricing
Benchmark vector search: Test APPROX_NEAR_COSINE() performance with realistic phonetic embeddings (10M+ vectors) early in evaluation process
Validate graph queries: Test Attestation-as-node pattern with multi-hop queries combining graph traversal, geospatial constraints, and vector filtering
Decision timeline: Required for v4 development planning (target: Q2 2025)

4.1.11. Conclusion¶

The shift to an attestation-based property graph model with Attestations as nodes (not just edge metadata) necessitates reconsideration of the database backend documented in previous WHG v4 plans (Vespa). For full details on the data model, see WHG v4 Data Model Documentation.

ArangoDB emerges as the optimal technical solution, offering native support for:

Property graphs where Attestations are first-class nodes
Document collections for Things, Names, Geometries, Timespans, and Authorities
Unified AUTHORITY collection pattern via discriminator
GeoJSON geometries for complex historical boundaries
Vector similarity search for phonetic name matching (via Siamese BiLSTM embeddings)
All capabilities integrated in a single system with unified AQL query language

The lack of GeometryCollection support is adequately addressed by our attestation model, which naturally accommodates multiple geometry attestations per Thing—each with its own provenance, temporal bounds, and certainty assessment.

The viability of ArangoDB depends on establishing sustainable licensing appropriate for an academic, non-commercial research infrastructure project funded primarily by University of Pittsburgh.

PostgreSQL with extensions provides a proven alternative that guarantees sustainability and includes full GeometryCollection support, but requires accepting greater query complexity (especially for graph traversals with Attestations as nodes) and careful performance optimization.

The database choice is a critical architectural decision that will shape WHG v4’s development timeline, operational requirements, and long-term sustainability.