-
Notifications
You must be signed in to change notification settings - Fork 43
Description
Problem
The Cartesi Machine emulator uses backing files for memory ranges (flash drives, RAM, etc.) and maps these files to
When a sparse file is given as a backing image, the emulator does not take advantage of its sparseness: holes are indistinguishable from zero-filled regions during initialization, and freed blocks inside a guest never become holes in the backing file on the host. This wastes both disk space and initialization time.
We propose to:
- Read side: When loading a sparse backing file, skip holes during initialization (avoid page faults and unnecessary hashing).
- Write side: When memory pages become zero (e.g., because a guest filesystem freed blocks), convert them back into holes in the backing file.
Proper use of sparse files would allow us to define a Cartesi Machine with truly massive, initially zeroed-out drives without any impact on emulator performance and host resource usage.
This, in turn, would allow applications that can grow their guest state essentially indefinitely, and even applications that only temporarily use additional guest/host space.
Hash-Tree-Based Hole Punching
Detection of holes at run-time
During hash tree updates, the emulator already iterates over dirty pages and calls is_pristine() on each one to detect all-zero pages (using SIMD-optimized checks). Pages that are pristine receive precomputed hashes rather than being hashed from their contents. This is existing behavior.
The key insight is that we can piggyback on this existing mechanism. When a dirty page is detected as pristine during a hash tree update, we additionally call fallocate(FALLOC_FL_PUNCH_HOLE) on the corresponding region of the backing file. This converts the zero-filled region into a hole, making the file sparse.
This approach works for all memory ranges in the emulator, not just flash drives. Any backing file for any memory range (RAM, flash drives, CMIO buffers, etc.) benefits automatically. The emulator does not need to know why a page became zero. It simply observes the zeros during the hash tree update it was already performing. There is no added cost: the is_pristine() check is already happening, and fallocate is only called for pages that are both dirty and pristine.
In practice, the main beneficiary is flash drives (pmem-backed), since they carry filesystem data where blocks are frequently allocated and freed. RAM backing files also benefit when large zero regions appear. CMIO buffers are typically small enough that the optimization has minimal impact on them.
Read-side optimization
On the read side, the emulator will use lseek with SEEK_HOLE/SEEK_DATA to build a map of data vs. hole regions in the backing file before initialization. Hole regions are known to be zero, so the emulator can skip them entirely: no page faults, no memory reads, no hashing. They receive pristine hashes directly.
Write-side optimization
When the emulator stores a machine to disk, it writes the contents of each memory range to a file. The existing create_file() already preserves sparsity by checking is_pristine() on each page and skipping the write for zero pages. However, this still reads every page to perform the check, causing page faults for hole regions that the read-side optimization carefully avoided. The fix is to use the hash-tree again to identify and skip hole regions entirely when writing, without touching the memory at all. The ftruncate that sets the output file size already ensures skipped regions remain holes.
Guest-side options for generating zeros
For memory ranges under direct user control (e.g., application data in RAM), writing zeros is sufficient. The hash tree will detect the pristine pages and punch holes automatically.
For filesystems, however, the situation is different. When a filesystem frees blocks (e.g., when files are deleted), it typically only updates its metadata without zeroing the freed blocks. The stale data remains in memory and the hash tree will not detect those pages as pristine. The filesystem must be made to write zeros when it discards blocks. The following options all result in zero pages that the hash tree detects, ordered from most to least preferred:
Without DAX
-
Device mapper target on pmem: A small custom DM target sits on top of the pmem block device. It passes reads and writes through to pmem, but intercepts
REQ_OP_DISCARDbios and zeros the underlying memory. No kernel driver patches required.fstrimandblkdiscardwork through standard block layer infrastructure. -
Patch the pmem driver: The Linux
nd_pmemdriver (drivers/nvdimm/pmem.c) does not currently supportREQ_OP_DISCARD. A small patch adds discard support topmem_submit_bio()where the discard handler zeros the memory region. After this patch,fstrimandblkdiscardwork and produce zero pages.
With DAX (not currently available)
DAX (CONFIG_FS_DAX) is not enabled in the current Cartesi kernel configuration because DAX support is still questionable even in recent kernels. These options are listed here for completeness, for when DAX support becomes available.
-
pmem with DAX: When DAX is enabled, the pmem driver's
pmem_dax_zero_page_range()callback already writes zeros to memory. The filesystem calls this through the DAX path when it needs to zero blocks (e.g., duringfallocate(FALLOC_FL_PUNCH_HOLE)). No pmem patch is needed. -
DAXFS: DAXFS is a filesystem that mounts directly on physical memory (no block device layer). When DAXFS is merged, we could patch it to implement
fallocate/punch_holeand zero freed pages directly. No pmem driver or block layer involvement at all.
HTIF-Based Optimization
The hash-tree-based approach requires the guest to write zeros to memory for every discarded block. This works correctly but is expensive in emulated CPU cycles: the guest must execute store instructions for every byte being zeroed, and the emulator must interpret each one.
This cost can be eliminated by notifying the emulator directly. Instead of writing zeros, the guest sends a message through the HTIF telling the emulator to zero a memory range. The emulator then zeros the host memory directly (a fast host-side memset or even faster by punching a hole in the backing file for shared mapping, or remaping zeros over a private one, see below).
Injecting the optimization for filesystem operations
For filesystem-triggered discards, the optimization is injected at the same point where the hash-tree-based approach would write zeros. Instead of zeroing all memory directly, the guest-side component optimizes this by using an SBI call, which OpenSBI routes to HTIF.
Limitation for non-filesystem memory ranges
The hash-tree-based approach is automatic for all memory ranges because it detects zeros regardless of their origin. The HTIF-based optimization, however, only works where the notification path has been wired up. For non-filesystem memory ranges (e.g., application code zeroing a region of RAM), there is no filesystem operation to intercept. The only way to extend the HTIF optimization to arbitrary memory would be to expose it as a new syscall (or similar interface) that guest applications can invoke directly to request that a memory range be discarded.
HTIF Mechanism
A single HTIF command zeros a naturally aligned, power-of-two-sized region of physical memory. The minimum clear size is 64KB, which amortizes the overhead of the SBI ecall and HTIF round-trip.
Encoding
The HTIF tohost register is 64 bits:
- Device (8 bits): a new device that zeroes physical memory
- Command (8 bits):
n, meaning the clear size is2^n * 64KB - Reason + Data (48 bits): the starting physical address, split across the existing REASON field (bits 47:32) and DATA field (bits 31:0), which must be aligned to the clear size
The entire region must fit inside the same PMA address range.
Emulator behavior
When the emulator receives this command, it:
- Zeros the specified region in host memory by either:
a. For shared mapping, punching a holes in the backing file for the zeroed region viafallocate(FALLOC_FL_PUNCH_HOLE).
b. For a private mapping, remap the zeroed region withMAP_ANONYMOUS|MAP_FIXED. - Marks the corresponding pages as dirty and pristine in the hash tree.
Clearing arbitrary regions
To zero an arbitrary region that is not naturally aligned or whose size is not a power-of-two multiple of 64KB, the guest decomposes it into a sequence of operations. The region may have a misaligned head and/or tail smaller than 64KB, which the guest clears itself via memset. The interior is covered by HTIF commands using chunks that grow in size up to the largest naturally aligned power-of-two that fits, then shrink back down toward the end.
For example, to clear a 500KB region starting at address 0xD000 (52KB, i.e. 12KB below the next 64KB boundary):
memsetthe first 12KB (below 64KB alignment boundary)- HTIF clear 64KB
- HTIF clear 128KB
- HTIF clear 256KB
memsetthe final 40KB (remainder below 64KB)
Implementation in the emulator
A new zero_memory(uint64_t paddr, int log2_size, uint64_t pma_index) method is added to i_state_access, following the same pattern as read_memory_word/write_memory_word where the caller resolves the PMA and passes its index. Unlike write_memory_with_padding (which writes data followed by zero padding), zero_memory is a dedicated zeroing operation that can be optimized to skip backing file holes entirely. The write_memory_with_padding method stays untouched.
The zero_memory method is implemented in each state access class:
state_access(direct execution): forwards tomachine::zero_memory(), the central implementation that finds the PMA, uses the backing file's hole map to skip already-zero regions, and zeros only data regions.record_step_state_access(step recording): records the zero range (address and size), forwards tomachine::zero_memory(), does not calltouch_page()for pages in the range. Pages already touched beforezero_memoryare already in the log. The zero range is stored in the log so thatget_sibling_hashes_impl()can refine the tree at the zero range boundaries, ensuring the sibling hashes needed for replay are captured. See Part 7 for details.replay_step_state_access(step replay): zeros logged pages within the range and replaces sibling hashes within the range with pristine hashes at the appropriate levels. See Part 7 for details.
The HTIF zero device also operates through the microarchitecture bridge. The machine_uarch_bridge_state_access issues a new ECALL (ua_zero_memory_ECALL), which uarch_step dispatches to i_uarch_state_access::zero_memory(). All signatures include pma_index. The uarch state access classes (uarch_state_access, uarch_record_state_access, uarch_replay_state_access, collect_uarch_cycle_hashes_state_access) each implement do_zero_memory(), following the same pattern as write_tlb and mark_dirty_page. The non-replay implementations forward to machine::zero_memory().
Implementation Plan
Part 1: OS sparse file utilities
Modify: src/os-features.h
- Add
HAVE_PUNCH_HOLE- defined on Linux (hasfallocate) and macOS (hasfcntl(F_PUNCHHOLE)). Higher-level code doesn't care about the mechanism.
Modify: src/os.h
- Add
void punch_hole(int fd, uint64_t offset, uint64_t length)- guarded byHAVE_PUNCH_HOLE, platform-specific - Add
data_regions_view data_regions(int fd, uint64_t length)- always available (unconditional). Lazy view usable in range-based for loops:for (auto [offset, length] : os::data_regions(fd, len)). Iterator callslseek(SEEK_DATA/SEEK_HOLE)on each advance. Falls back to yielding a single(0, length)region on platforms withoutSEEK_HOLE/SEEK_DATA. No allocation. - Add
hole_regions_view hole_regions(int fd, uint64_t length)- same lazy iterator pattern but for holes. SeeksSEEK_HOLEthenSEEK_DATA. Falls back to yielding nothing on unsupported platforms. Used by Part 3 read-side to find pages to pre-mark as clean/pristine.
Modify: src/os.cpp
punch_hole: platform-specific (fallocateon Linux,fcntl(F_PUNCHHOLE)on macOS), only compiled underHAVE_PUNCH_HOLEdata_regions: unconditional, useslseek(SEEK_HOLE/SEEK_DATA)with fallback to full-file regionhole_regions: unconditional, useslseek(SEEK_HOLE/SEEK_DATA)with fallback to empty region
Part 2: Expose backing file descriptor and add address_range::zero_memory()
src/os-mapped-memory.h- Addint get_backing_fd() const noexcept(returnsm_backing_fdon HAVE_MMAP, -1 otherwise). Still needed by Parts 3, 4, and 5 for querying the hole/data layout viaos::hole_regions()andos::data_regions().src/address-range.h:- Add non-virtual
int get_backing_fd() const noexceptthat callsdo_get_backing_fd(), and protectedvirtual int do_get_backing_fd() const noexcept { return -1; } - Add non-virtual
void zero_memory(uint64_t offset, uint64_t length)that callsdo_zero_memory(offset, length), and protectedvirtual void do_zero_memory(uint64_t offset, uint64_t length)with a default that does a plainmemsetto zero.
- Add non-virtual
src/memory-address-range.h:- Override
do_get_backing_fd()to delegate tom_mapped.get_backing_fd() - Override
do_zero_memory(offset, length): for shared mappings, callos::punch_hole(fd, offset, length)on the backing file — the mmap reflects the zeros automatically. For private mappings, callmmap(MAP_ANONYMOUS | MAP_FIXED)over the range, replacing the existing pages so that subsequent reads fault in fresh zeroed pages without touching the backing file.
- Override
Part 3: Read side - skip holes during init
Pre-mark hole pages as clean with pristine hashes before the first update_hash_tree(), avoiding page faults entirely for hole regions. This extends the optimization suggested by the existing TODO in memory-address-range.cpp (which considers pre-initializing newly created backing stores as clean) to the more general case of any sparse backing file with holes.
How it works:
- DPT constructor prepares the tree (all leaves dirty) but does NOT call
set_initialized() - During hash tree init, for each address range: check
dpt.is_initialized(). If not:
a. Iterateos::hole_regions(fd, length)to find hole intervals vialseek(SEEK_HOLE/SEEK_DATA)(cheap kernel metadata, no I/O)
b. For each page in a hole region:- Write pristine page hash into DHT leaf via
node_hash_view(offset, HASH_TREE_LOG2_PAGE_SIZE) - Mark clean in DPT via
mark_clean_page_and_up(offset)(propagates upward: parent becomes clean when both children are clean)
c. Walk DHT bottom-up level by level. At each level, for nodes that are clean in the DPT (both children are clean/pristine), write the pristine hash for that level. Nodes still dirty (at least one data-page descendant) are left for the normalupdate_dense_trees()to handle later.
d. Calldpt.set_initialized()
- Write pristine page hash into DHT leaf via
- When
update_hash_tree()later runs, it only iterates dirty pages = data regions. No page faults for holes.update_dense_trees()only visits inner nodes above data pages. - On subsequent loads,
dpt.is_initialized()is already true (from stored file), scan is skipped.
Modify:
src/dirty-page-tree.h- Makeis_initialized()andset_initialized()public. Removeset_initialized()call from constructor.src/hash-tree.cpp- Add a method (e.g.,init_pristine_holes) that, for each address range with a backing fd and uninitialized DPT: scans for holes, writes pristine hashes at leaf level, marks clean in DPT, then walks the DHT bottom-up writing pristine hashes at inner nodes where both children are pristine, and finally callsset_initialized(). Called from hash tree constructor. The hash tree already hasm_pristine_hashes(indexed by level) and access to all address ranges.
Part 4: Write side - hole punching in hash tree update
When dirty pages are detected as pristine during hash tree update, coalesce and punch holes in the backing file.
Integration point: hash_tree::update(), inside if (update_succeeded), before clean() calls. At this point the DHT is fully updated and the DPT still knows which pages were dirty.
Algorithm (for each changed address range with shared backing file):
- Iterate dirty pages left-to-right in the DPT. For each dirty page whose DHT hash equals the pristine hash:
- While this node is a left child and its right sibling is also pristine in the DHT (regardless of dirty status - DHT is fully up-to-date): move up a level.
- Punch a single hole at the (possibly coalesced) size. Skip all remaining dirty pages covered by this hole.
Modify:
src/hash-tree.cpp(update) - add hole-punching pass between update and cleansrc/hash-tree.h- any needed interface changes (e.g., access to pristine hashes for comparison)
Part 5: Sparse-aware file writing
os::create_file() in os-filesystem.cpp already preserves sparsity by skipping pwrite for pristine pages. However, it still reads every page to call is_pristine(), causing page faults for hole regions we carefully avoided on the read side. The previous idea of using os::data_regions(fd) to skip holes only works for shared mappings, because private mappings may have dirty pages that diverged from the backing file's hole layout.
The approach instead is to use the hash tree (DHT), which works regardless of mapping type. Before storing, the hash tree must be up to date (call update_hash_tree() first). Then, when writing each address range to disk:
- Walk the DHT leaves. For each page whose DHT hash equals the pristine page hash, skip it (don't write, don't read the memory).
- To skip even larger regions without checking every leaf: walk up the DHT from pristine leaves. If a node's hash equals the pristine hash for its level, the entire subtree is zero — skip the whole range in one step. This is the same coalescing idea as Part 4, but applied to file writing instead of hole punching.
- For non-pristine pages,
pwritethem to the output file as before. ftruncatealready ensures the output file has the correct size, so skipped regions remain holes.
Modify:
src/os-filesystem.h/src/os-filesystem.cpp- Add a DHT-aware overload ofcreate_file(or modify the existing one) that accepts the DHT and pristine hashes, skipping pristine regions by walking the tree instead of reading memory.src/machine.cpp(store_address_range) - Update hash tree before storing, pass DHT tocreate_file.
Part 6: HTIF zero device and i_state_access::zero_memory()
New HTIF device (device=3) that zeros a power-of-two aligned region of physical memory. Implemented via a new zero_memory(uint64_t paddr, int log2_size, uint64_t pma_index) method in i_state_access, where paddr is aligned to 2^log2_size. The caller resolves the PMA and passes its index, following the same pattern as read_memory_word/write_memory_word. This is a dedicated zeroing operation. write_memory_with_padding stays untouched.
Encoding: Device=3, Command=n (clear size = 2^(n+16) bytes), Reason+Data (48 bits) = physical address
src/htif-defines.h- Add#define HTIF_DEV_ZERO_DEF 3src/htif-constants.h- AddHTIF_DEV_ZEROenumsrc/i-state-access.h- Addzero_memory(paddr, log2_size, pma_index)that callsderived().do_zero_memory(paddr, log2_size, pma_index)src/i-device-state-access.h- Add virtualread_pma(index)method (so HTIF handlers can find PMAs), add virtualzero_memory(paddr, log2_size, pma_index)methodsrc/device-state-access.h- Implementread_pmaandzero_memoryby delegating tom_asrc/htif-address-range.cpp- Addhtif_zero()handler:- Decode:
paddr = (HTIF_REASON_FIELD(tohost) << 32) | HTIF_DATA_FIELD(tohost),log2_size = cmd + 16 - Find the PMA using
find_pma(fromfind-pma.h) via the newa->read_pma(), obtainingpma_index - Validate: paddr aligned to
2^log2_size, region fits within that PMA. If validation fails, returnexecute_status::failuresointerpretsees the failure fromwrite_deviceand reports it, same as any other invalid device write - Call
a->zero_memory(paddr, log2_size, pma_index)
- Decode:
src/state-access.h- Implementdo_zero_memory(paddr, log2_size, pma_index)by forwarding tom_m.zero_memory(paddr, log2_size, pma_index).src/machine.h/src/machine.cpp- Addmachine::zero_memory(uint64_t paddr, int log2_size, uint64_t pma_index)as the central implementation. Usespma_indexto find the address range and delegates toaddress_range::zero_memory(offset, length), which handles shared vs. private mappings internally (see Part 2). Used by the variousi_state_accessandi_uarch_state_accessimplementations, following the same pattern asmachine::write_unverified_tlb().uarch/uarch-ecall.h/uarch/uarch-ecall.c- Addua_zero_memory_ECALL(uint64_t paddr, uint64_t log2_size, uint64_t pma_index)that issues an ECALL to the host.uarch/machine-uarch-bridge-state-access.h- Implementdo_zero_memory()by callingua_zero_memory_ECALL(paddr, log2_size, pma_index).src/uarch-solidity-compat.h- AddzeroMemoryECALL(a, paddr, log2_size, pma_index)wrapper that callsa.zero_memory(paddr, log2_size, pma_index), following the pattern ofmarkDirtyPageECALL,writeTlbECALL, etc.src/uarch-step.h- Handle the new ECALL by callingzeroMemoryECALL(a, paddr, log2_size, pma_index).src/i-uarch-state-access.h- Addzero_memory(paddr, log2_size, pma_index)that callsderived().do_zero_memory(paddr, log2_size, pma_index).src/uarch-state-access.h- Implementdo_zero_memory(paddr, log2_size, pma_index)by forwarding tom_m.zero_memory(paddr, log2_size, pma_index).src/uarch-record-state-access.h- Implementdo_zero_memory(paddr, log2_size, pma_index): logs the access + forwards tom_m.zero_memory(paddr, log2_size, pma_index).src/uarch-replay-state-access.h- Implementdo_zero_memory(paddr, log2_size, pma_index): verifies from the log.src/collect-uarch-cycle-hashes-state-access.h- Implementdo_zero_memory(paddr, log2_size, pma_index)as needed.
Part 7: Step log support for zero_memory
Problem
The step log's tree structure is determined by which pages are in the log. The top-down traversal (get_sibling_hashes_impl during recording, compute_root_hash_impl during replay) recurses into ranges that contain logged pages and emits sibling hashes for ranges that don't. When zero_memory zeroes a region, we don't want to log every page in that region. But the zeroed region may be smaller than the sibling node the traversal would normally emit for that area, so we can't simply replace a sibling with a pristine hash — the sibling might cover more than the zeroed region.
Solution: zero ranges as a second source of tree refinement
Zero ranges (naturally aligned, power-of-two-sized) always correspond to nodes in the binary tree. We store them in the log alongside pages. Both the recording and replay traversals use page indices and zero ranges to determine tree structure. Zero ranges force the traversal to split at their boundaries, even when no logged pages are present, producing sibling hashes at the right granularity.
Log structure change
Add a zero_ranges array to the step log binary format, after the existing siblings:
[existing header: root_hash_before, mcycle_count, root_hash_after, hash_function]
[page_count][pages...]
[sibling_count][siblings...]
[zero_range_count] // NEW
[(page_index: uint64, log2_page_count: uint8, root_sibling_index: uint64), ...] // NEW, sorted by page_index
Each zero range entry stores page_index, log2_page_count, and root_sibling_index. The root_sibling_index field is scratch space, like the page hash field in page entries: the log allocates room for it, but replay overwrites it with the correct value derived from the verified traversal. The recording side can write any value (e.g., zero).
Recording
record_step_state_access::do_zero_memory(paddr, log2_size, pma_index):
- Records
(paddr >> AR_LOG2_PAGE_SIZE, log2_size - AR_LOG2_PAGE_SIZE)in the zero ranges list. - Forwards to
m_m.zero_memory(paddr, log2_size, pma_index)to actually zero memory. - Does NOT call
touch_page()for pages in the range. Pages already touched beforezero_memoryare already in the log with their original (pre-zero) contents. Pages touched afterzero_memorywill be logged with zero contents.
Modified get_sibling_hashes_impl(): The traversal gains a next_zero iterator over the sorted zero ranges, in addition to the existing next_page iterator. At each node:
- Exact match: If this node exactly matches the next zero range:
- If no logged pages within: emit one sibling hash for the whole range, advance
next_zero. - If logged pages within: advance
next_zero, then fall through to recursion. The recursion refines the zero range's subtree, producing sub-siblings that capture the internal structure around the logged pages.
- If no logged pages within: emit one sibling hash for the whole range, advance
- No pages, no zeros in range: Emit one sibling hash (unchanged from today).
- Otherwise: Recurse into left and right halves. This is triggered when the range contains logged pages, or when it contains a zero range that doesn't match exactly (forcing a split at the zero range boundary).
Replay
Modified compute_root_hash_impl(): Mirrors the recording traversal exactly, using the same next_zero iterator logic. This ensures the replay traversal constructs the same tree shape and consumes pages, siblings, and zero ranges in the same order. When the traversal reaches a zero range node, it writes the current next_sibling value into the entry's root_sibling_index field — same pattern as the page hash scratch area.
replay_step_state_access::do_zero_memory(paddr, log2_size, pma_index): First, look up the zero range entry matching (paddr, log2_size). If no matching entry exists, the log is invalid — throw an error (same as "too few sibling hashes in log"). Then, two steps:
-
Zero logged pages: For each page in the log within
[page_index, page_index + page_count):memsetits data to zero and set its hash to the pristine page hash. -
Replace sub-siblings with pristine hashes: Read
root_sibling_indexfrom the zero range entry (written bycompute_root_hash_implduring initial root hash verification). Walk the tree within the zero range using page indices to determine structure and replace each encountered sibling with the pristine hash for its level:
replace_siblings_with_pristine(page_index, log2_page_count, &next_page, &next_sibling):
page_count = 1 << log2_page_count
end = page_index + page_count
has_pages = (next_page < total && pages[next_page].index < end)
if !has_pages:
// Sub-range has no logged pages — it is a sibling, replace with pristine
siblings[next_sibling] = pristine_hashes[log2_page_count + AR_LOG2_PAGE_SIZE]
next_sibling++
return
if log2_page_count > 0:
replace_siblings_with_pristine(page_index, log2_page_count - 1, ...)
replace_siblings_with_pristine(page_index + (page_count >> 1), log2_page_count - 1, ...)
else:
// Leaf: logged page already zeroed in step 1
if pages[next_page].index == page_index:
next_page++
else:
// Leaf with no logged page — sibling, replace with pristine
siblings[next_sibling] = pristine_page_hash
next_sibling++
Correctness
Initial root hash verification (first compute_root_hash): Uses original page data and original sibling hashes. Pages within the zero range that were logged before zero_memory contribute to the tree normally — their hashes are verified through the tree structure. Sub-ranges within the zero range that have no logged pages use sibling hashes representing the original state. The computed root hash matches. As a side effect, the correct root_sibling_index values are written into the zero range entries.
After do_zero_memory: Logged pages within the range are zeroed and their hashes set to pristine. Sub-siblings within the range are replaced with pristine hashes at the appropriate levels. All data within the zero range now hashes to pristine.
Final root hash computation (second compute_root_hash): Uses the modified pages and siblings. The zero range's subtree correctly hashes to the pristine hash for its level. The rest of the tree is unchanged. The computed root hash matches the expected post-step hash.
Files
src/record-step-state-access.h- Addm_zero_rangeslist, implementdo_zero_memory(), modifyget_sibling_hashes_impl()andget_sibling_hashes()to pass and consume zero ranges, write zero ranges infinish().src/replay-step-state-access.h- Parse zero ranges from log, addfind_zero_range()(binary search bypage_index, throws if not found), modifycompute_root_hash_impl()to use zero ranges for tree structure and writeroot_sibling_indexinto entries, implementdo_zero_memory()withreplace_siblings_with_pristine().
Part 8: Cartesi syscall for memory zeroing
A new Linux syscall (or ioctl on a Cartesi device) that allows guest userspace applications to efficiently zero a potentially large region of memory by issuing HTIF zero commands. This extends the HTIF-based optimization to non-filesystem memory ranges, where there is no filesystem discard operation to intercept.
Interface:
syscall(__NR_cartesi_zero_memory, void *addr, size_t length)- zeros the virtual memory range[addr, addr+length)from the guest's perspective
Core kernel helper (exported for use by other kernel modules):
- Takes a physical address and length
- Decomposes into the power-of-two aligned sequence described in "Clearing arbitrary regions": memset for the sub-64KB head and tail, SBI ecalls for the interior chunks (growing then shrinking powers of two)
- Issues one legacy SBI call per chunk, which OpenSBI encodes into the HTIF tohost register
Syscall implementation:
- Uses standard kernel APIs (e.g.,
get_user_pages()) to translate the virtual range to physical pages - For each contiguous physical region, calls the core kernel helper
Files (in the Linux kernel tree, not the emulator):
- New built-in driver with the core helper function and syscall/ioctl
- Userspace header with the syscall number / ioctl definition
- The core helper is exported (
EXPORT_SYMBOL) for use by the DM target in Part 9
Part 9: Device mapper target on pmem
A small custom device mapper target that sits on top of the pmem block device. It passes reads and writes through to pmem unchanged, but intercepts REQ_OP_DISCARD bios and uses the core kernel helper from Part 8 to issue HTIF zero commands. This allows standard Linux filesystem discard operations (fstrim, blkdiscard, and mount-time -o discard) to transparently reclaim backing file space through the emulator's thin provisioning, without any kernel driver patches.
Implementation:
- A built-in driver implementing
struct dm_targetwith.mapand.io_hintscallbacks .mapcallback: forREQ_OP_DISCARDbios, translate the bio's sector offset to a physical address (pmem_start + bio->bi_sector * 512), then call the core kernel helper from Part 8. For all other bio types, pass through to the underlying pmem device..io_hintscallback: advertise discard support with page-size (4KB)discard_granularity. The core helper handles the decomposition: memset for sub-64KB head/tail, HTIF for aligned interior chunks. The filesystem doesn't need to know about the 64KB HTIF minimum.
Files (in the Linux kernel tree, not the emulator):
- New built-in driver under
drivers/md/ - Depends on the exported helper from Part 8
Implementation order
- Parts 1+2 (OS utilities + expose fd) - foundation, no behavioral change
- Part 3 (read-side optimization)
- Part 4 (write-side hole punching)
- Part 5 (sparse-aware file writing)
- Parts 6+7 (HTIF zero device + step log support)
- Part 8 (Cartesi syscall + core kernel helper)
- Part 9 (DM target on pmem)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status