Lattice Filesystem design
Status: Unimplemented design (roadmap M21). Depends on M20 (CAS).
A replicated, CoW, snapshot-capable filesystem built as a Lattice store type. Borrows architectural ideas from ZFS (Merkle block trees, copy-on-write, datasets, snapshots, scrubbing) and maps them onto Lattice’s two-channel model: the Intention DAG for causally ordered metadata, CAS for bulk data.
Design Goals
- FUSE-mountable: Appears as a regular directory on Linux/macOS. Standard tools (
cp,vim,rsync) work without modification. - Offline-first: Full read/write capability without network. Changes sync when connectivity returns.
- Snapshot-capable: Instant, cheap snapshots (ZFS-style). A snapshot pins a root CID at O(1) cost.
- Replicated: Directory tree converges across peers via the DAG. File content distributes via CAS with CRUSH-based redundancy.
- Auditable: Every file change is a signed intention in the DAG with author attribution and causal history.
- Conflict-visible: Concurrent edits to the same file produce conflict markers (not silent overwrites).
Architecture: Two-Channel Split
The filesystem uses the DAG and CAS for fundamentally different purposes, mirroring the reliability spectrum described in the CAS design:
| Concern | Channel | Replication | Read Guarantee |
|---|---|---|---|
| Directory tree structure | DAG (intentions) | Eager — all peers | Always succeeds |
| File metadata (size, mtime, mode) | DAG (intentions) | Eager — all peers | Always succeeds |
| Snapshots, renames, deletes | DAG (intentions) | Eager — all peers | Always succeeds |
| Block pointer tree (indirect blocks) | CAS | Push to CRUSH peers | Recoverable |
| Data blocks (file content) | CAS | Push to CRUSH peers | Recoverable |
The analogy to ZFS is direct:
| ZFS Component | Lattice Equivalent | Channel |
|---|---|---|
| Uberblock (root pointer) | Intention: UpdateRoot { root_cid, txg } | DAG |
| ZIL (intent log) | The Intention DAG itself | DAG |
| Indirect blocks (block pointer tree) | CAS blobs containing child CID lists | CAS |
| Data blocks | CAS blobs (leaf content) | CAS |
| Space maps | Local-only (each node manages its own CAS storage) | Neither |
The ZIL Analogy
The ZFS Intent Log records pending synchronous writes so they survive a crash before the next transaction group commits. It has two modes: WR_COPIED (small writes — data inlined in the log record) and WR_INDIRECT (large writes — log record contains only a block pointer, data written separately).
Lattice intentions follow the same split naturally:
| ZIL Mode | Lattice Equivalent |
|---|---|
| WR_COPIED (data + metadata inline, < 32 KB) | Small file operations inlined in intention payload |
| WR_INDIRECT (metadata + pointer to data block) | Intention payload contains CID reference, data in CAS |
The threshold is configurable per store. Below it, the full operation (including small file content) travels through the DAG for maximum reliability. Above it, only the CID pointer enters the DAG and the bulk data travels through CAS.
CRUSH Replication Classes
Not all CAS blobs are equally critical. The filesystem store can assign different CRUSH replication factors by blob class:
| Blob Class | Content | Default CRUSH Factor | Rationale |
|---|---|---|---|
metadata | Indirect blocks (block pointer tree nodes) | All peers (100%) | Loss means orphaned data blocks. Small, critical. |
data | File content chunks | Store default (e.g., 3) | Loss is recoverable from other peers. Large, bulk. |
snapshot | Snapshot manifest blobs | All peers | Loss means snapshot is unresolvable. Small, critical. |
The StateMachine::referenced_blobs() implementation returns CIDs along with their class, allowing the CasManager to apply different CRUSH policies per class. This mirrors ZFS’s distinction between metadata and data redundancy (e.g., copies=2 for metadata, copies=1 for data).
Data Model
Inode Table (TABLE_DATA)
Every filesystem object (file, directory, symlink) is an inode. The inode table lives in state.db (projected from intentions) using the existing redb TABLE_DATA:
Key: inode/{inode_id}
Value: InodeEntry (protobuf)
message InodeEntry {
uint64 inode_id = 1;
InodeType type = 2; // FILE, DIRECTORY, SYMLINK
uint32 mode = 3; // Unix permissions
uint64 size = 4; // File size in bytes
uint64 created_at = 5; // HLC wall_time
uint64 modified_at = 6; // HLC wall_time
bytes content_cid = 7; // Root CID of content Merkle tree (files only)
string symlink_target = 8; // Target path (symlinks only)
repeated ChunkRef chunks = 9; // Inline chunk list for small files (< tree threshold)
}
message ChunkRef {
bytes cid = 1;
uint64 offset = 2;
uint64 length = 3;
}
Directory Table
Directory entries are stored as a separate key prefix to enable efficient readdir:
Key: dir/{parent_inode_id}/{name}
Value: uint64 child_inode_id
A readdir becomes a prefix scan on dir/{inode_id}/ — a single redb range query, entirely local.
Root Pointer
The filesystem root inode is a well-known entry:
Key: root
Value: uint64 root_inode_id
Snapshot Table
Key: snapshot/{name}
Value: SnapshotEntry (protobuf)
message SnapshotEntry {
string name = 1;
uint64 root_inode_id = 2; // Inode ID of the root at snapshot time
bytes root_content_state = 3; // Opaque state reference for the frozen tree
uint64 created_at = 4; // HLC wall_time
bytes creating_intention = 5; // Hash of the snapshot intention (for causal anchoring)
}
Transaction Counter
Key: txg
Value: uint64 transaction_group_number
Monotonically increasing per node. Used to batch multiple writes into a single intention (see Write Path below).
Block Tree Structure
File content is stored as a Merkle tree of CAS blobs, similar to ZFS’s indirect block tree:
[Root CID] ← InodeEntry.content_cid
/ \
[Indirect L1] [Indirect L1] ← CAS blobs: list of child CIDs
/ | \ |
[Data] [Data] [Data] [Data] ← CAS blobs: raw file content chunks
- Data blocks (leaves): Raw file content, produced by content-defined chunking (FastCDC). Target size: 1 MiB, min 256 KiB, max 4 MiB.
- Indirect blocks (internal nodes): CAS blobs containing an ordered list of
ChunkRef { cid, offset, length }entries. Each indirect block covers a contiguous byte range of the file. - Root CID: The top of the Merkle tree, stored in
InodeEntry.content_cid.
Small file optimization: Files under the minimum chunk size (256 KiB) are stored as a single CAS blob. The InodeEntry.chunks field holds the single ChunkRef inline — no indirect blocks, no tree. Files under an even smaller threshold (configurable, e.g. 4 KiB) could embed content directly in the intention payload (WR_COPIED mode).
Integrity: Every CAS blob is content-addressed (CIDv1, SHA2-256). The Merkle tree provides transitive integrity — verifying the root CID verifies the entire file. This is strictly equivalent to ZFS’s block checksum tree but with cryptographic hashes.
Scrubbing: Walk the Merkle tree from root, verify each CAS blob’s hash against its CID. Corrupt or missing blobs are re-fetched from CRUSH duty holders. This can run as a background maintenance task.
Operations
Intention Payload
message FsPayload {
repeated FsOp ops = 1;
uint64 txg = 2; // Transaction group number (for batching)
}
message FsOp {
oneof op_type {
CreateOp create = 1;
MkdirOp mkdir = 2;
UpdateFileOp update_file = 3;
UnlinkOp unlink = 4;
RenameOp rename = 5;
ChmodOp chmod = 6;
SnapshotOp snapshot = 7;
BatchCreateOp batch_create = 8;
}
}
message CreateOp {
uint64 parent_inode = 1;
string name = 2;
uint32 mode = 3;
InodeType type = 4;
bytes content_cid = 5; // Initial content (for files created with data)
repeated ChunkRef chunks = 6; // Chunk list
uint64 size = 7;
}
message UpdateFileOp {
uint64 inode_id = 1;
bytes content_cid = 2; // New root CID of content tree
repeated ChunkRef chunks = 3; // Updated chunk list
uint64 size = 4;
}
message RenameOp {
uint64 src_parent = 1;
string src_name = 2;
uint64 dst_parent = 3;
string dst_name = 4;
}
message UnlinkOp {
uint64 parent_inode = 1;
string name = 2;
}
message SnapshotOp {
string name = 1;
}
message BatchCreateOp {
repeated CreateOp creates = 1; // Batch file creation (for import, cp -r)
}
Write Path (Buffered, Transaction-Group Model)
Writes are buffered locally and flushed on fsync or periodically (like ZFS transaction groups). This is critical — without buffering, a cp 1GB_file ~/lattice/ would generate thousands of individual intentions.
FUSE write(inode, offset, buf)
→ Buffer in local page cache (mark dirty). No intention, no CAS write.
FUSE fsync(fd) [or periodic flush, or close()]
→ Collect dirty pages for inode
→ Content-defined chunking (FastCDC) on modified regions
→ cas.put_batch(new_chunks) → Vec<Cid>
→ Build updated block tree (CoW: new indirect blocks for modified paths,
reuse unchanged subtrees by CID)
→ cas.put(new_indirect_blocks) → new root CID
→ Submit intention: FsPayload {
ops: [UpdateFileOp { inode, content_cid: new_root, chunks, size }],
txg: next_txg()
}
→ CasManager pushes new CAS blobs to CRUSH duty holders (async)
Transaction groups: Multiple writes across multiple files can be batched into a single intention (single FsPayload with multiple FsOps). The FUSE layer accumulates dirty state and flushes at a configurable interval (default: 5 seconds, matching ZFS’s txg commit interval). This amortizes the Ed25519 signing and DAG overhead across many file operations.
Metadata-only operations (rename, chmod, mkdir, unlink) are submitted immediately as tiny intentions — no buffering needed since they don’t involve CAS blobs.
Read Path
FUSE read(inode, offset, len)
→ state.db lookup: inode/{id} → InodeEntry { content_cid, chunks }
→ Resolve offset to chunk:
If inline chunks: binary search ChunkRef list
If tree: walk indirect blocks (CAS reads) to find leaf
→ cas.get_range(leaf_cid, local_offset, local_len)
→ If CAS hit: return data
→ If CAS miss: request from CRUSH peers, block until available (or EIO on timeout)
Caching: An in-memory ARC cache (keyed by (cid, range)) serves hot reads without CAS filesystem I/O. Size the cache to available RAM (configurable, default 256 MiB).
Readahead: Detect sequential access patterns and prefetch upcoming chunks from CAS (or from peers if not local). This enables streaming video playback from a FUSE-mounted Lattice filesystem.
Copy-on-Write
Every write produces new CAS blobs (new data blocks + new indirect blocks on the modified path). Unchanged subtrees are shared by CID. This is identical to ZFS’s CoW semantics:
Before: Root_v1 → [Indirect_A] → [Data_1, Data_2, Data_3]
Write to offset in Data_2:
After: Root_v2 → [Indirect_A'] → [Data_1, Data_2', Data_3]
(shared) (new) (shared)
Root_v1, Indirect_A, Data_2 remain in CAS (immutable).
A snapshot pinning Root_v1 retains the old tree.
Without a snapshot, old blobs become GC-eligible.
Snapshots
A snapshot is a single intention that records the current root state:
FsOp::Snapshot { name: "before-upgrade" }
apply():
→ Read current root inode state
→ Store SnapshotEntry { name, root_inode_id, root_content_state, created_at }
→ Cost: one intention (~100 bytes). No data copies.
Because all CAS blobs are immutable and content-addressed, the snapshot is just a metadata marker. The Merkle tree rooted at the snapshot’s content_cid is preserved as long as the snapshot exists. GC cannot collect blobs reachable from any live snapshot.
Browsing snapshots: Snapshots are read-only views. A FUSE mount could expose them as .snapshots/{name}/ virtual directories (like ZFS .zfs/snapshot/), resolving reads against the frozen inode table state.
Incremental replication (analogy to zfs send/recv): Diff two snapshots by walking their Merkle trees from the root. CIDs present in the new tree but absent from the old tree are the incremental changeset. Transfer only those blobs. The DAG provides the snapshot bookkeeping; CAS provides the block transfer.
Conflict Resolution
The filesystem store does not use LWW for file content. It implements a tree CRDT with conflict markers.
File Content Conflicts
When two peers edit the same file offline, each produces a new UpdateFileOp with a different content_cid. These are concurrent intentions (neither author had seen the other’s write).
The state machine detects this via the heads mechanism (same as KVTable, but per-inode): |heads(inode_id)| > 1 means concurrent content updates.
Resolution strategy: Conflict files.
report.docx ← One version (most recent by HLC)
report.conflict-alice-2026-03-23T14:30.docx ← The other version
The state machine creates a synthetic directory entry pointing to the losing version’s CID. The user sees both files and chooses which to keep. Deleting the conflict file (or overwriting the main file) produces an intention that causally depends on both heads, collapsing the conflict.
Structural Conflicts (Tree CRDT)
Directory operations introduce global conflicts that no single inode’s headlist can detect (see Conflict Resolution Architecture, filesystem section). The state machine handles these at apply() time:
| Conflict | Detection | Resolution |
|---|---|---|
Name collision (two peers create /docs/report.txt) | dir/{parent}/{name} has multiple heads | Auto-rename: report.txt, report (2).txt |
| Orphan (delete parent while child is created) | Child’s parent_inode points to a deleted inode | Re-parent to lost+found/ |
| Cycle (move A into B, move B into A concurrently) | Parent-chain walk during apply() | LWW on move operations — one move wins, other is reverted |
| Move + rename (move file, rename same file concurrently) | Inode has multiple parent pointers in heads | LWW on the inode’s parent/name, conflict file for content if applicable |
The DagQueries handle passed to apply() provides find_lca() and is_ancestor() for the causal context needed to resolve these correctly.
FUSE Daemon (lattice-fs)
Following the Service Architecture “Kernel & User-Space” model, the FUSE daemon is an external process:
lattice-fs (daemon)
│
├── Connects to lattice-node via UDS (Unix Domain Socket)
├── Authenticates with a Service Token
├── Opens the target store (must be core:filestore type)
├── Mounts FUSE at a configured path
│
├── Read path: FUSE → inode lookup (state.db) → CAS get_range
├── Write path: FUSE → local page cache → fsync → CAS put_batch → submit intention
└── Watch path: subscribes to FsState events for live updates from other peers
Live updates: When another peer’s intention arrives and is projected, the FUSE daemon receives a WatchEvent (via the StreamProvider subscription) and invalidates affected inodes in its local cache. The next stat() or read() sees the updated state.
Crash isolation: If lattice-fs crashes, the Lattice node is unaffected. Unflushed writes in the page cache are lost (standard POSIX behavior — same as if the kernel crashed with dirty pages). CAS blobs from completed put_batch calls that lack a corresponding intention are orphans, cleaned up by GC after the grace period.
Selective Sync
Not all peers need all file content. A device can sync the full directory tree (metadata, via the DAG) but only fetch content for selected subtrees:
system/fs/sync_policy → SyncPolicy proto
message SyncPolicy {
repeated SyncRule rules = 1;
}
message SyncRule {
string path_prefix = 1; // e.g., "/photos/", "/videos/"
SyncMode mode = 2;
}
enum SyncMode {
FULL = 0; // Fetch all content CAS blobs
METADATA_ONLY = 1; // DAG only, no CAS blobs (directory listing works, reads fail)
ON_DEMAND = 2; // Fetch CAS blobs on first access, cache locally
}
The referenced_blobs() implementation filters CIDs based on the sync policy: only CIDs for FULL subtrees are reported to the CasManager as required. METADATA_ONLY subtrees show in readdir and stat (size, mtime from inode metadata in state.db) but read() returns EIO or triggers an on-demand fetch.
This enables the multi-device tiering pattern:
| Device | Policy | Behavior |
|---|---|---|
| Phone | /photos/ → ON_DEMAND, everything else METADATA_ONLY | Browse tree, fetch photos on tap |
| Laptop | /documents/ → FULL, /videos/ → ON_DEMAND | Documents always available, videos on demand |
| NAS | / → FULL | Everything, always. The cold archive. |
Performance Characteristics
Throughput Estimates
| Operation | Mechanism | Expected Performance |
|---|---|---|
readdir (100 entries) | redb prefix scan, pure local | < 1 ms |
stat | redb point lookup | < 0.1 ms |
| Sequential read (local) | ARC cache + CAS filesystem read | 200-500 MB/s (SSD-bound) |
| Sequential read (LAN peer) | QUIC chunk streaming | 50-200 MB/s (network-bound) |
| Random read (cached) | In-memory ARC | Sub-millisecond |
| Random read (CAS miss, local) | redb lookup + filesystem seek | ~0.1-1 ms (SSD) |
| File write (buffered) | RAM page cache | Memory speed |
| fsync (1 MB dirty) | CAS put + sign + commit | ~5-15 ms |
| fsync (100 MB dirty, txg flush) | Batch CAS put + sign + commit | ~200-500 ms |
| mkdir / rename / unlink | Single intention (tiny) | ~1-5 ms |
| Snapshot | Single intention (tiny) | ~1-5 ms |
Write Amplification
Per file save (fsync):
- CAS blob writes (one per new/modified chunk + indirect blocks)
- Intention write (log.db — one redb transaction)
- Witness record write (log.db — batched with above)
- Projection write (state.db — one redb transaction)
Total: 3 durable writes per fsync. Comparable to a journaling filesystem (ext4: 1-2 writes). The transaction-group batching model amortizes this across many file operations.
Scaling Limits
- Intention signing: Ed25519 at ~60 us/op = ~16K intentions/sec. With txg batching (many file ops per intention), this is not the bottleneck for filesystem workloads.
- Inode count: redb handles millions of keys. 1M files = ~1M inode entries + ~1M directory entries = manageable.
- CAS blob count: At 1 MiB average chunk size, 1 TB of data = ~1M blobs. The sharded
blobs/{shard}/{cid}layout handles this (256 shards, ~4K files per shard). - DAG depth: A filesystem with 1M file operations over its lifetime has a deep DAG. LCA computation is O(depth) but only needed for conflict resolution, not normal reads.
Comparison with Alternatives
| Feature | Syncthing | Dropbox | Lattice FS |
|---|---|---|---|
| Architecture | Peer-to-peer | Client-server | Peer-to-peer |
| Conflict handling | Conflict files | Conflict copies (server merge) | Conflict files + tree CRDT |
| Large file sync | 128 KiB block-level | Proprietary chunking | Content-defined chunking (FastCDC) |
| Selective sync | Per-folder ignore patterns | Selective Sync (paid) | Per-subtree sync policy |
| Versioning | None (external only) | 180-day history (paid) | Full DAG history + snapshots |
| Offline support | Full | Limited (cached files) | Full (local-first) |
| Audit trail | None | Limited | Full — every change is a signed intention |
| Snapshots | No | No | Yes (ZFS-style, instant, cheap) |
| Integrity verification | Block-level hash | Opaque | Merkle tree (transitive, cryptographic) |
| Encryption | TLS in transit | At rest + transit (server has keys) | Store-level E2E (client-side) |
Store Type Registration
pub const STORE_TYPE_FILESTORE: &str = "core:filestore";
// In lattice-runtime:
self.with_opener(STORE_TYPE_FILESTORE, || {
direct_opener::<SystemLayer<FsState>>()
})
FsState implements StateLogic with type Event = FsEvent. The SystemLayer<FsState> wrapping provides the system leg (peers, hierarchy, invites) automatically. The FsState::apply() decodes FsPayload from the intention and mutates the inode/directory tables in TABLE_DATA.
Open Questions
Chunk manifest threshold: At what file size should the inline
ChunkReflist inInodeEntrybe replaced by a CAS blob manifest? Candidate: when the chunk list exceeds ~64 KB (roughly 1500 chunks at 40 bytes each, corresponding to a ~1.5 GB file). Above this, the intention payload should contain only a manifest CID.Hardlinks: ZFS supports hardlinks (multiple directory entries pointing to the same inode). This is feasible (the inode table is separate from the directory table) but adds refcounting complexity for GC. Defer to a later milestone unless required.
Extended attributes (xattrs): Useful for macOS metadata, SELinux labels, etc. Could be stored as additional keys under
xattr/{inode_id}/{name}. Not critical for MVP.mmap support: FUSE
mmaprequires the file to be locally available in full. With lazy CAS replication, this means either blocking until all chunks are fetched, or limitingmmapto files withSyncMode::FULL. Needs investigation.Deduplication across files: Content-defined chunking provides implicit dedup within a store (same chunk content = same CID). Cross-store dedup is not supported (per-store CAS sidecar). Within-store dedup is free and automatic.