RFC: Memory Barriers API Design

| Authors(s)          | @KurtWu10   |
|:-------------------:|:-----------:|
| Date of last update | 27 Feb 2026 |

This RFC is a work in progress.

## Overview
Programmers write sddf code in imperative languages like C, pancake or Rust. However, the actual runtime behaviour on hardware could be different from a simple sequential execution of the program text: for example, both the [compiler](https://lwn.net/Articles/793253/) and the hardware may reorder accesses to memory and device registers. These are optimisations performed by both the compiler and hardware to try to improve the performance of program executions, simplifying hardware implementation, among others.

A common case where explicit barriers are NOT required is for ordering accesses to device registers only: sddf's uncached memory maps to strongly-ordered device memory of the corresponding architectures, which prevents reordering by the hardware; sddf also uses volatile accesses to prevent reordering by the compiler.

However, barriers are sometimes required.

### Example
Consider a snippet in `tx_provide()` of the `imx` Ethernet driver's transmit path:
https://github.com/au-ts/sddf/blob/544bc7c2932c03456c0a0316ccb89b687588a72e/drivers/network/imx/ethernet.c#L151-L153

The [`update_ring_slot()` function](https://github.com/au-ts/sddf/blob/544bc7c2932c03456c0a0316ccb89b687588a72e/drivers/network/imx/ethernet.c#L62-L74) first writes the address and packet length to a descriptor (in device memory), then updates the 'own' bit of the descriptor to transfer the ownership of the descriptor to device.

After that, L153 triggers the device by writing to a device doorbell register. Since there is no barrier between the update to the 'own' bit inside the `update_ring_slot()` function and the update at L153 to the device doorbell register, they could be reordered on Arm and RISC-V, such that when the device is awakened, it finds that the descriptor is not owned by it and cannot start transmission immediately.

(I have discussed this issue with Courtney, and we agree that we accept this relaxed behaviour, because the device is guaranteed to see the update to the own bit when a second packet occurs.)

To prevent these sometimes undesired executions, programmers are required to explicitly express the requirements in sddf to both the compiler and the hardware. This could be achieved using fence operations (a.k.a. barriers).

## Current Implementation
On C, sddf currently uses [fence operations](https://github.com/au-ts/sddf/blob/544bc7c2932c03456c0a0316ccb89b687588a72e/include/sddf/util/fence.h#L11-L38) provided by the C standard library (for the [C11 standard](https://open-std.org/JTC1/SC22/WG14/www/docs/n1570.pdf)[^1]). On pancake, Foreign Function Interface (FFI) calls are required (if necessary). Similar fence operations exist on Rust.

### Issues
The major issue with using these operations is that they are not designed for device interaction in general. These operations are designed for shared-memory communications between processors (on _normal_ memory), and they could be insufficient for processors communicating with devices:
- On RISC-V, for example, [`FENCE` instructions](https://docs.riscv.org/reference/isa/unpriv/rv32.html#fence) with the I and O bits are used for ordering device operations, but the compiler will not use these bits for fence operations, as they are irrelevant to thread synchronisations. These means that using these fence operations does not provide the required ordering for devices.

On other architectures, while the compilation on today's compiler is usually correct, the inherent incompatibility of processor communication and device communication [complicates](https://github.com/au-ts/sddf/pull/641#discussion_r2838885976) reasoning.

## Linux Implementation
Linux provides various [Linux kernel memory barriers](https://github.com/torvalds/linux/blob/v6.19/Documentation/memory-barriers.txt) for the same purpose. These operations provide a common API that abstracts the differences among architectures.

For the purpose of this RFC, the relevant operations are barriers for DMA devices `dma_wmb()/dma_rmb()` and accessors for MMIO registers `readX()/writeX()/readX_relaxed()/writeX_relaxed()`.

## Proposal

### Solution 1
A simple solution is to use an _architecture-independent_ API for barriers (`dma_wmb()/...`), which is coarse-grained and may be implemented by more than one instructions. This simplifies reasoning about their correctness[^2]. The current implementation for `cache_clean/cache_invalidate` is an example of this.

For this solution, we need to understand whether the memory in sddf use the same attributes as Linux (e.g. Device-nGnRnE on Armv8), because the Linux barriers are designed only for memory with default attributes. (In fact it doesn't: sddf's uncached memory is [Device-nGnR**nE**](https://github.com/seL4/seL4/blob/7dc04b9a4c84cfb27d14d635cd48d513a48ecca2/src/arch/arm/64/kernel/vspace.c#L698), while Linux uses [Device-nGnR**E**](https://github.com/torvalds/linux/blob/v6.19/arch/arm64/include/asm/io.h#L270))

Strictly following the Linux implementation may incur some additional overhead. For example,
- the Linux specification for `readX()` and `writeX()` operations explicitly mentions that
  > A `writeX()` issued by a CPU thread holding a spinlock is ordered before a `writeX()` to the same peripheral from another CPU thread issued after a later acquisition of the same spinlock.

  while the PDs in sddf are always single-threaded. On RISC-V, the corresponding `mmiowb()` can be removed.
- Another case is that the `readX()` operation in Linux is required to complete before any subsequent `delay()` loop. If sddf does not rely on it,
  - on Arm64, the [extra code](https://github.com/torvalds/linux/blob/v6.19/arch/arm64/include/asm/io.h#L105-L113) in Linux's read I/O barrier can be removed
  - on RISC-V, implementation of the I/O read barrier could be simplified.

A drawback is that, on pancake, using an architecture-independent API may introduce additional overhead for fence operations that are compiler barriers, as the language has already [enforced](https://github.com/CakeML/cakeml/blob/master/pancake/NEWS.md#shared-memory-operations) the ordering between shared-memory operations.

Such a common interface may be [complex](https://lore.kernel.org/all/20180629142539.GH17271@n2100.armlinux.org.uk/t/#u), and platform-specific routines could be required.

#### Example: I/O accessors
**Assuming** that [load tearing and store tearing](https://lwn.net/Articles/793253/) do not occur with volatile accesses, the current implementation naturally maps to `readX_relaxed()` and `writeX_relaxed()` operations. In cases where ordering between memory and I/O accesses, I/O barriers including `__io_ar()/__io_bw()/__io_aw()/__io_bw()` could be used.

#### Draft Implementation
| macro       | x86                | arm64   | risc-v       |
|:-----------:|:------------------:|:--------:|:------------:|
| `dma_rmb()` | (compiler barrier) | `dmb ld` | `fence r, r` |
| `dma_wmb()` | (compiler barrier) | `dmb st` | `fence w, w` |
| `iormb()`   | (compiler barrier) | `dmb ld` | `fence i, r` |
| `iowmb()`   | (compiler barrier) | `dmb st` | `fence w, o` |

#### Interaction with Cache Maintenance Operations
Cache maintenance operations (`cache_clean()` and `cache_invalidate()`) should be stronger than the corresponding memory barriers (`dma_wmb()+iowmb()` and `dma_rmb()+iormb()` resp.), even on architectures with coherent DMA. This is to prevent redundant barrier operations like [this](https://github.com/au-ts/sddf/pull/641#discussion_r2838766441).

### Solution 2
Another potential implementation is to explicitly use _architecture-specific_ inline assembly, like
```C
#if defined(__aarch64__)
#define dmb() asm volatile("dmb sy" ::: "memory")
#endif
```

without going through the `dmb_mb` abstraction, i.e. the main difference compared to the previous solution is in macro names. This makes the programmer's requirement even more explicit, but it may be unclear which operation a device driver should use.

## TODOs
1. implementation.
2. further explanations.

[^1]: The current sddf actually uses [compiler builtins for atomics](https://gcc.gnu.org/onlinedocs/gcc-15.2.0/gcc/_005f_005fatomic-Builtins.html) from gcc, which follow the C++11 memory model that is highly similar to the C11 memory model.
[^2]: To the best of my knowledge, there is no formal semantics for device interactions on any architecture that is supported by sddf. The Linux kernel memory consistency model, for example, explicitly [excludes](https://github.com/torvalds/linux/blob/v6.19/tools/memory-model/Documentation/explanation.txt#L86-L87) DMA. Arm's architecture reference manual (section B2.3, version M.a.a) and RISC-V's RVWMO memory consistency model similarly exclude device interactions. We can at best follow the Linux implementation and assume that it is correct.


macro	x86	arm64	risc-v
`dma_rmb()`	(compiler barrier)	`dmb ld`	`fence r, r`
`dma_wmb()`	(compiler barrier)	`dmb st`	`fence w, w`
`iormb()`	(compiler barrier)	`dmb ld`	`fence i, r`
`iowmb()`	(compiler barrier)	`dmb st`	`fence w, o`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Memory Barriers API Design #642

Overview

Example

Current Implementation

Issues

Linux Implementation

Proposal

Solution 1

Example: I/O accessors

Draft Implementation

Interaction with Cache Maintenance Operations

Solution 2

TODOs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	update_ring_slot(&tx, idx, buffer.io_or_offset, buffer.len, stat);
	tx.tail++;
	eth->tdar = TDAR_TDAR;

RFC: Memory Barriers API Design #642

Description

Overview

Example

Current Implementation

Issues

Linux Implementation

Proposal

Solution 1

Example: I/O accessors

Draft Implementation

Interaction with Cache Maintenance Operations

Solution 2

TODOs

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions