Architecture¶
Alopex DB is built in Rust for safety, performance, and reliability. This page covers the technical architecture and design decisions.
System Overview¶
graph TB
subgraph "API Layer"
REST[REST API]
GRPC[gRPC API]
WIRE[Postgres Wire Protocol]
LIB[Embedded Library]
end
subgraph "Query Layer"
PARSER[SQL Parser]
ANALYZER[Semantic Analyzer]
PLANNER[Query Planner]
OPTIMIZER[Cost-Based Optimizer]
EXECUTOR[Query Executor]
end
subgraph "Transaction Layer"
TXN[Transaction Manager]
MVCC[MVCC Engine]
LOCK[Lock Manager]
WAL[Write-Ahead Log]
end
subgraph "Storage Layer"
MEMTABLE[MemTable]
LSM[LSM-Tree]
SSTABLE[SSTable Files]
VECIDX[Vector Index]
BLOOM[Bloom Filters]
end
subgraph "Cluster Layer"
RAFT[Raft Consensus]
SHARD[Range Sharding]
CHIRPS[Chirps Mesh]
META[Metadata Store]
end
REST --> PARSER
GRPC --> PARSER
WIRE --> PARSER
LIB --> PARSER
PARSER --> ANALYZER
ANALYZER --> PLANNER
PLANNER --> OPTIMIZER
OPTIMIZER --> EXECUTOR
EXECUTOR --> TXN
TXN --> MVCC
MVCC --> LOCK
TXN --> WAL
MVCC --> MEMTABLE
MEMTABLE --> LSM
LSM --> SSTABLE
LSM --> VECIDX
LSM --> BLOOM
LSM --> RAFT
RAFT --> SHARD
SHARD --> CHIRPS
CHIRPS --> META Storage Engine¶
LSM-Tree Design¶
Alopex uses a Log-Structured Merge-Tree (LSM-Tree) optimized for both key-value and vector workloads.
graph TB
subgraph "Write Path"
W[Write] --> WAL[WAL]
WAL --> MT[MemTable]
MT -->|Flush| L0[Level 0 SSTable]
end
subgraph "Compaction"
L0 -->|Compact| L1[Level 1]
L1 -->|Compact| L2[Level 2]
L2 -->|Compact| LN[Level N]
end
subgraph "Read Path"
R[Read] --> MT
R --> L0
R --> L1
R --> LN
end Key Components¶
| Component | Description |
|---|---|
| MemTable | In-memory sorted map (skip list) for recent writes |
| WAL | Write-ahead log for durability |
| SSTable | Sorted String Table, immutable on-disk files |
| Bloom Filter | Probabilistic structure for fast key lookups |
| Block Cache | LRU cache for frequently accessed blocks |
Write Path¶
- WAL Append: Write logged for durability
- MemTable Insert: Data added to in-memory structure
- MemTable Flush: When full, written as L0 SSTable
- Compaction: Background merge of SSTables
Read Path¶
- MemTable Check: Search recent writes
- Block Cache: Check cached data blocks
- Bloom Filter: Probabilistic key existence check
- SSTable Search: Binary search within files
Vector Index Integration¶
Vector indexes are co-located with data for locality:
graph LR
subgraph "SSTable"
DATA[Data Blocks]
META[Metadata]
BLOOM[Bloom Filter]
VEC[Vector Index Segment]
end
DATA --> VEC HNSW Implementation¶
Layer 3: [sparse connections for long-range jumps]
↓
Layer 2: [more connections]
↓
Layer 1: [dense connections]
↓
Layer 0: [all nodes, maximum connectivity]
Search algorithm: 1. Enter at top layer 2. Greedily traverse to nearest neighbor 3. Descend to next layer 4. Repeat until layer 0 5. Return top-k neighbors
Transaction Layer¶
MVCC (Multi-Version Concurrency Control)¶
Alopex uses MVCC for snapshot isolation:
sequenceDiagram
participant T1 as Transaction 1
participant DB as Database
participant T2 as Transaction 2
T1->>DB: BEGIN (ts=100)
T2->>DB: BEGIN (ts=101)
T1->>DB: UPDATE row (version=100)
T2->>DB: READ row
Note over T2,DB: Sees original version<br/>(snapshot at ts=101)
T1->>DB: COMMIT
T2->>DB: READ row
Note over T2,DB: Still sees original<br/>(snapshot isolation)
T2->>DB: COMMIT Transaction States¶
| State | Description |
|---|---|
Active | Transaction in progress |
Committed | Successfully committed |
Aborted | Rolled back |
Prepared | 2PC prepared state |
Isolation Levels¶
| Level | Dirty Read | Non-Repeatable | Phantom |
|---|---|---|---|
| Read Uncommitted | |||
| Read Committed | |||
| Repeatable Read | |||
| Snapshot (default) | |||
| Serializable |
Query Layer¶
SQL Parser¶
Built on sqlparser-rs with extensions for vectors:
-- Extended grammar
vector_type ::= VECTOR '(' dimension ')'
vector_literal ::= '[' number (',' number)* ']'
vector_function ::= cosine_similarity | l2_distance | inner_product
Query Planner¶
Cost-based optimizer with vector-aware planning:
graph TD
SQL[SQL Query] --> PARSE[Parse Tree]
PARSE --> LOGICAL[Logical Plan]
LOGICAL --> OPTIMIZE[Optimization Rules]
OPTIMIZE --> PHYSICAL[Physical Plan]
PHYSICAL --> EXEC[Execution]
subgraph "Optimization Rules"
R1[Predicate Pushdown]
R2[Join Reordering]
R3[Index Selection]
R4[Vector Index Routing]
end Execution Engine¶
Volcano-style iterator model:
trait Executor {
fn open(&mut self) -> Result<()>;
fn next(&mut self) -> Result<Option<Row>>;
fn close(&mut self) -> Result<()>;
}
Cluster Layer¶
Raft Consensus¶
Each data range forms a Raft group:
graph TB
subgraph "Range [a-m)"
L1[Leader]
F1[Follower]
F2[Follower]
end
subgraph "Range [m-z)"
L2[Leader]
F3[Follower]
F4[Follower]
end
L1 <-.->|Raft| F1
L1 <-.->|Raft| F2
L2 <-.->|Raft| F3
L2 <-.->|Raft| F4 Raft Operations¶
| Operation | Description |
|---|---|
| Leader Election | Automatic failover on leader failure |
| Log Replication | Synchronous replication to followers |
| Snapshot | Periodic state snapshots for recovery |
| Membership Change | Dynamic cluster reconfiguration |
Range Sharding¶
Data partitioned by key range:
Key Space: [0000...FFFF]
├── Range 1: [0000...3FFF] → Node 1 (Leader), Node 2, Node 3
├── Range 2: [4000...7FFF] → Node 2 (Leader), Node 3, Node 1
├── Range 3: [8000...BFFF] → Node 3 (Leader), Node 1, Node 2
└── Range 4: [C000...FFFF] → Node 1 (Leader), Node 3, Node 2
Chirps Mesh¶
Chirps is the control plane foundation for Alopex DB clusters. It provides node discovery, membership management, and message transport with a three-layer architecture:
graph TB
subgraph "Chirps Architecture"
API[API Layer<br/>send_to / broadcast / subscribe]
ROUTER[Routing Layer<br/>Message Profiles]
subgraph "Backend Layer"
QUIC[QUIC Backend]
IGGY[Iggy Backend]
end
end
subgraph "Alopex DB"
RAFT[Raft Messages]
DATA[Data Events]
end
RAFT -->|Control Profile| API
DATA -->|Durable Profile| API
API --> ROUTER
ROUTER -->|Low Latency| QUIC
ROUTER -->|Persistent| IGGY
style QUIC fill:#1E3A5F,color:#fff
style IGGY fill:#5FB4C9,color:#fff Message Profiles¶
| Profile | Purpose | Backend | Use Case |
|---|---|---|---|
| Control | Raft consensus, cluster control | QUIC | Priority streams, < 1ms latency |
| Ephemeral | Gossip, transient data | QUIC | Best effort, no persistence |
| Durable | Event streams, audit logs | Iggy | Persistent, replayable (v0.9+) |
Key Features¶
- SWIM Protocol: Failure detection via ping/ack/ping-req
- QUIC Transport: TLS 1.3, 0-RTT resumption, multiplexed streams
- Priority Streams: Raft messages get dedicated high-priority channels
- Membership Events:
on_node_join,on_node_leave,on_status_changehooks
Learn More
See the dedicated Chirps documentation for detailed architecture and configuration.
Memory Management¶
Buffer Pool¶
┌─────────────────────────────────────┐
│ Buffer Pool │
├─────────┬─────────┬─────────┬───────┤
│ Block 1 │ Block 2 │ Block 3 │ ... │
└─────────┴─────────┴─────────┴───────┘
↑ ↑ ↑
│ │ │
Page Page Page
Table Table Table
Memory Budget¶
| Component | Default | Configurable |
|---|---|---|
| Buffer Pool | 25% of RAM | buffer_pool_size |
| MemTable | 64 MB | memtable_size |
| Block Cache | 512 MB | block_cache_size |
| Vector Index | 1 GB | vector_index_memory |
File Format¶
SSTable Structure¶
┌──────────────────────────────────┐
│ Data Blocks │
├──────────────────────────────────┤
│ Meta Blocks │
├──────────────────────────────────┤
│ Index Block │
├──────────────────────────────────┤
│ Bloom Filter │
├──────────────────────────────────┤
│ Vector Index Segment │
├──────────────────────────────────┤
│ Footer │
└──────────────────────────────────┘
WAL Format¶
┌─────────┬──────────┬─────────┬──────────┐
│ Length │ Checksum │ Type │ Data │
│ (4 bytes)│ (4 bytes)│ (1 byte)│ (var) │
└─────────┴──────────┴─────────┴──────────┘
Performance Characteristics¶
| Operation | Complexity | Notes |
|---|---|---|
| Point Read | O(log N) | With bloom filter: O(1) expected |
| Range Scan | O(log N + K) | K = result count |
| Write | O(1) amortized | Plus WAL sync |
| Vector Search (HNSW) | O(log N) | Approximate |
| Vector Search (Flat) | O(N × D) | Exact, D = dimensions |