Skip to content

RFC: Memory Barriers API Design #642

@KurtWu10

Description

@KurtWu10
Authors(s) @KurtWu10
Date of last update 27 Feb 2026

This RFC is a work in progress.

Overview

Programmers write sddf code in imperative languages like C, pancake or Rust. However, the actual runtime behaviour on hardware could be different from a simple sequential execution of the program text: for example, both the compiler and the hardware may reorder accesses to memory and device registers. These are optimisations performed by both the compiler and hardware to try to improve the performance of program executions, simplifying hardware implementation, among others.

A common case where explicit barriers are NOT required is for ordering accesses to device registers only: sddf's uncached memory maps to strongly-ordered device memory of the corresponding architectures, which prevents reordering by the hardware; sddf also uses volatile accesses to prevent reordering by the compiler.

However, barriers are sometimes required.

Example

Consider a snippet in tx_provide() of the imx Ethernet driver's transmit path:

update_ring_slot(&tx, idx, buffer.io_or_offset, buffer.len, stat);
tx.tail++;
eth->tdar = TDAR_TDAR;

The update_ring_slot() function first writes the address and packet length to a descriptor (in device memory), then updates the 'own' bit of the descriptor to transfer the ownership of the descriptor to device.

After that, L153 triggers the device by writing to a device doorbell register. Since there is no barrier between the update to the 'own' bit inside the update_ring_slot() function and the update at L153 to the device doorbell register, they could be reordered on Arm and RISC-V, such that when the device is awakened, it finds that the descriptor is not owned by it and cannot start transmission immediately.

(I have discussed this issue with Courtney, and we agree that we accept this relaxed behaviour, because the device is guaranteed to see the update to the own bit when a second packet occurs.)

To prevent these sometimes undesired executions, programmers are required to explicitly express the requirements in sddf to both the compiler and the hardware. This could be achieved using fence operations (a.k.a. barriers).

Current Implementation

On C, sddf currently uses fence operations provided by the C standard library (for the C11 standard1). On pancake, Foreign Function Interface (FFI) calls are required (if necessary). Similar fence operations exist on Rust.

Issues

The major issue with using these operations is that they are not designed for device interaction in general. These operations are designed for shared-memory communications between processors (on normal memory), and they could be insufficient for processors communicating with devices:

  • On RISC-V, for example, FENCE instructions with the I and O bits are used for ordering device operations, but the compiler will not use these bits for fence operations, as they are irrelevant to thread synchronisations. These means that using these fence operations does not provide the required ordering for devices.

On other architectures, while the compilation on today's compiler is usually correct, the inherent incompatibility of processor communication and device communication complicates reasoning.

Linux Implementation

Linux provides various Linux kernel memory barriers for the same purpose. These operations provide a common API that abstracts the differences among architectures.

For the purpose of this RFC, the relevant operations are barriers for DMA devices dma_wmb()/dma_rmb() and accessors for MMIO registers readX()/writeX()/readX_relaxed()/writeX_relaxed().

Proposal

Solution 1

A simple solution is to use an architecture-independent API for barriers (dma_wmb()/...), which is coarse-grained and may be implemented by more than one instructions. This simplifies reasoning about their correctness2. The current implementation for cache_clean/cache_invalidate is an example of this.

For this solution, we need to understand whether the memory in sddf use the same attributes as Linux (e.g. Device-nGnRnE on Armv8), because the Linux barriers are designed only for memory with default attributes. (In fact it doesn't: sddf's uncached memory is Device-nGnRnE, while Linux uses Device-nGnRE)

Strictly following the Linux implementation may incur some additional overhead. For example,

  • the Linux specification for readX() and writeX() operations explicitly mentions that

    A writeX() issued by a CPU thread holding a spinlock is ordered before a writeX() to the same peripheral from another CPU thread issued after a later acquisition of the same spinlock.

    while the PDs in sddf are always single-threaded. On RISC-V, the corresponding mmiowb() can be removed.

  • Another case is that the readX() operation in Linux is required to complete before any subsequent delay() loop. If sddf does not rely on it,

    • on Arm64, the extra code in Linux's read I/O barrier can be removed
    • on RISC-V, implementation of the I/O read barrier could be simplified.

A drawback is that, on pancake, using an architecture-independent API may introduce additional overhead for fence operations that are compiler barriers, as the language has already enforced the ordering between shared-memory operations.

Such a common interface may be complex, and platform-specific routines could be required.

Example: I/O accessors

Assuming that load tearing and store tearing do not occur with volatile accesses, the current implementation naturally maps to readX_relaxed() and writeX_relaxed() operations. In cases where ordering between memory and I/O accesses, I/O barriers including __io_ar()/__io_bw()/__io_aw()/__io_bw() could be used.

Draft Implementation

macro x86 arm64 risc-v
dma_rmb() (compiler barrier) dmb ld fence r, r
dma_wmb() (compiler barrier) dmb st fence w, w
iormb() (compiler barrier) dmb ld fence i, r
iowmb() (compiler barrier) dmb st fence w, o

Interaction with Cache Maintenance Operations

Cache maintenance operations (cache_clean() and cache_invalidate()) should be stronger than the corresponding memory barriers (dma_wmb()+iowmb() and dma_rmb()+iormb() resp.), even on architectures with coherent DMA. This is to prevent redundant barrier operations like this.

Solution 2

Another potential implementation is to explicitly use architecture-specific inline assembly, like

#if defined(__aarch64__)
#define dmb() asm volatile("dmb sy" ::: "memory")
#endif

without going through the dmb_mb abstraction, i.e. the main difference compared to the previous solution is in macro names. This makes the programmer's requirement even more explicit, but it may be unclear which operation a device driver should use.

TODOs

  1. implementation.
  2. further explanations.

Footnotes

  1. The current sddf actually uses compiler builtins for atomics from gcc, which follow the C++11 memory model that is highly similar to the C11 memory model.

  2. To the best of my knowledge, there is no formal semantics for device interactions on any architecture that is supported by sddf. The Linux kernel memory consistency model, for example, explicitly excludes DMA. Arm's architecture reference manual (section B2.3, version M.a.a) and RISC-V's RVWMO memory consistency model similarly exclude device interactions. We can at best follow the Linux implementation and assume that it is correct.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions