diff --git a/src/proposed/0200-domain-schedules.md b/src/proposed/0200-domain-schedules.md new file mode 100644 index 0000000..b21c5e6 --- /dev/null +++ b/src/proposed/0200-domain-schedules.md @@ -0,0 +1,341 @@ + + +# Runtime Domain Schedules + +- Author: Gerwin Klein, Rafal Kolanski, Indan Zupancic +- Proposed: 2025-09-10 + +## Summary + +We propose to add the ability to set, maintain, and switch between domain +schedules at runtime, authorised by the existing `DomainCap`, retaining the +current information flow proof and behaviour when the `DomainCap` no longer +exists in the system or is not used. + +## Motivation + +Currently the domain schedule is provided as a `.h` file and compiled into the +kernel. This was an initial stop-gap implementation and is inconvenient for SDK +approaches such as the Microkit and overall inflexible and hard to use. + +For instance, it is currently not possible to provide a verified kernel binary +and modify it with a new kernel schedule without invalidating the verification. + +It is also currently not possible to run a system in multiple modes, such as +"wheels up" and "wheels down" in aviation. + +The intended use for the proposed mechanism is that the initialiser sets up +domain schedules at boot time, for instance the two schedules in the example +above, or a single domain schedule as would now be provided as `.h` file. After +that, if the system is to remain static, the `DomainCap` can be deleted as it is +now, or it can be given to a high privilege mode control component that can +atomically switch between schedules. + +Updating a running system to tweak an existing schedule or add a new mode at +runtime would also become possible. + +## Guide-level explanation + +At a conceptual level, we propose to mostly keep the current domain scheduler +implementation as it is: a compile-time sized array of entries consisting of +domain and duration. Instead of representing a single schedule as before, we +propose to use the array now for representing multiple schedules. The new +components are: + +- entries with the value `(0, 0)` are end markers for a domain schedule +- there is a new kernel global that contains the start index for the current + domain schedule + +### Example + +The following diagram shows an example. + +```none +----------------------------------------------------------------- +| (d_0, t_0) | (d_1, t_1) | ... | (0, 0) | (d_x, t_x) | .. | .. | +----------------------------------------------------------------- + 0 ^ n n+1 + +ksDomainStart = 0 +ksDomScheduleIdx = 1 +``` + +The example has an overall array of length `config_DomScheduleLength`, +configured at compile time, a current domain start index of 0, and an currently +active domain schedule index of 1. The current entry will run domain `d_1` for a +duration of `t_1 > 0`, and the schedule will keep going until it hits index `n` +which has an entry with value `(0, 0)`. Instead of running the entry at index `n`, +it proceeds at `ksDomainStart`, i.e. index 0. + +To populate the entires, a user thread with the `DomainCap` can invoke the +`setDomainEntry` method to set duration and domain for a specific index, +potentially with duration 0 to create an end marker. + +To atomically switch to the second schedule in the example, a user thread with +the `DomainCap` can set `ksDomainStart` to `n+1`, which will start running +domain `d_x` for a duration `t_x`. The schedule will keep going until either +another entry with duration `0` occurs, or the index hits the end of the array. +In both cases, the index of the next entry will again be `ksDomainStart`. + +### Conditions and Invariants + +For both, setting the domain start and setting the value of an entry, the kernel +will prevent the user from creating a schedule where the entry at +`ksDomainStart` has duration 0. This would signify an empty domain schedule. It +would be possible to give this situation a useful meaning: not switching domains +until a new schedule is set. However, implementing it would be more invasive +than the currently proposed changes and the same effect can already be achieved +with a one-element schedule of long duration. + +Apart from the value `(0, 0)`, the kernel will prevent the creation of entries +with duration 0. The duration is in timer ticks, and on MCS the duration at the +API level must be >= MIN_PERIOD. + +The kernel initialises with an array where all entries are `(0, 0)`, apart from +the entry at index 0, which will run domain 0 for the maximum expressible time. + +The kernel will preserve the following invariants: + +- `ksDomainStart < config_DomScheduleLength` +- `ksDomScheduleIdx < config_DomScheduleLength` +- `schedule[ksDomainStart].duration ~= 0` +- for all `i` with `0 <= i config_DomScheduleLength`, + if `schedule[i].duration = 0` then `schedule[i].domain = 0` + + +## Reference-level explanation + +### Invocations + +The current invocations of the `DomainCap` remain as they are. +There are two new invocations: + +#### Set_Domain_Start + +`static inline int seL4_Set_Domain_Start` + +Set the start index of the current domain schedule. The schedule entry at this +index must not be the end marker `(0, 0)`. The domain at the schedule entry with +the specified index will start running immediately after the kernel call +completes. + +| Type | Name | Description | +| ----------- | ---------- | ----------------------------------------------- | +| `seL4_DomainSet` | `_service` | Domain capability to authorise the call | +| `seL4_Word` | `index` | The new start index. Must not point to `(0,0)` and must be smaller than `config_DomScheduleLength` | + +The function returns `seL4_NoError` for success as usual. The following error +codes are returned otherwise: + +| Error | Possible cause | +| ---------------------- | ----------------------------------------------- | +| `seL4_InvalidCapability` | the provided capability is not the domain capability | +| `seL4_InvalidArgument` | the entry at the provided index is an end marker | +| `seL4_RangeError` | the index is not less than `config_DomScheduleLength` | + + +#### Set_Domain_Entry + +`static inline int seL4_Set_Domain_Entry` + +Set and entry in the domain schedule at the specified index to the specified +domain and duration. If the duration is 0, the domain must also be 0 to indicate +and end marker. On MCS, the duration must be `>= MIN_BUDGET`. If the index is +the current schedule start index, the entry must not be an end marker `(0, 0)`. + +The change to the schedule takes effect when the entry is next read by the +kernel. In particular, when the duration or domain of the currently running +schedule index are changed, the change will only take effect after the current +domain time slice has expired and the schedule reaches the current index again. + +Section [Initialisation](#initialisation) contains a scenario where setting the +entry at the currently running index may be useful, but generally one should +avoid updating the currently running schedule. Instead create new schedules +beyond current end marker and then set the schedule start once the system is +ready to switch. + +| Type | Name | Description | +| ----------- | ---------- | ----------------------------------------------- | +| `seL4_DomainSet` | `_service` | Domain capability to authorise the call | +| `seL4_Word` | `index` | The index of the entry to set. Must be smaller than `config_DomScheduleLength`. | +| `seL4_Uint8` | `domain` | The domain of the schedule entry. Must be smaller than `CONFIG_NUM_DOMAINS`. Must be 0 if the duration is 0. | +| `seL4_Word` | `duration` | The duration for the entry. On MCS, must be 0 or `>= MIN_BUDGET`. | + +The function returns `seL4_NoError` for success as usual. The following error +codes are returned otherwise: + +| Error | Possible cause | +| ---------------------- | ----------------------------------------------- | +| `seL4_InvalidCapability` | the provided capability is not the domain capability | +| `seL4_InvalidArgument` | the index is the current domain start index and the duration is 0, or the duration is 0, but the domain is not 0. | +| `seL4_RangeError` | the index is not less than `config_DomScheduleLength`, or the domain is not less than `CONFIG_NUM_DOMAINS`, or on MCS the duration is less than `MIN_BUDGET` | + +### Configuration Options + +The new config option `config_DomScheduleLength` determines the static size of +the overall domain schedule array and thereby the longest domain schedule that +is possible to configure at runtime. The default value for +`config_DomScheduleLength` is 256. + +### Initialisation + +At system startup, the array contains 2 active entries: entry 0 with domain 0 +and a long duration, and entry 1 with `(0, 0)`. With the following scheme it is +possible to use the full length of the array even though these two entries are +already in use. + +At kernel boot time, given a user-provided schedule [(d_0, t_0), (d_1, t_1), +...] that satisfies the requirements [described +above](#conditions-and-invariants), the root task could achieve the provided +schedule as follows: + +1. First, set up the of rest system as before, including starting all threads +2. Set all `(d_i, t_i)` according to the schedule where `i > 1`. +3. Then, overwrite the two active schedule entries 0 and 1 with + `(d_0, t_0)` and `(d_1, t_1)`. +4. Set the schedule start to the desired start value, e.g. index 0 +5. Suspend/stop the initialiser + +With this the first run of domain 0 after the initialiser will not get its full +time slice, because step 5 will already run in the user-provided schedule, but +after that, the user-provided schedule will be in force. + +This works if the initialiser can finish its work within the duration of entry +0. Since the duration is set to the maximum expressible time, this should in +practice never be an issue. Even if the time is not sufficient, the procedure +will still work unless the domain time of the initialiser happens to expire +during the execution of step 3. If that is a possibility, the initialiser could +include a suitable delay before step 3 to make this impossible. + +The reason the scheme works is that the kernel will not act on new values in the +schedule before the current domain slice has expired whereas setting the start +index in step 4 comes into effect immediately. + +### capDL initialiser + +The capDL initialiser will change to also initialise the domain schedule if one +is provided. + +The schedule is provided in a separate new section of the capDL input +specification, specifying the schedule as a comma-separated list of pairs +(domain, duration). If no end marker is provided at the end of the list, +an implicit end marker is assumed. + +## Drawbacks + +The main drawback is that this is a breaking change for users. Setting and +initialising a domain schedule now works differently, and build scripts may +need to be updated to no longer generate/add the former `.h` file that contained +the schedule. + +In theory a tool could be provided that converts current `.h` file schedules +into the capDL sections, but since `.h` files can contain anything we are not +proposing such a tool as part of this RFC. + +Tests for the domain scheduler in sel4test will need to be changed to create the +test schedules at initialisation time instead of expecting them to be compiled +in. + +Tools and applications that do not use the domain scheduler should be +unaffected. + +## Rationale and alternatives + +The proposed design was chosen for its conceptual simplicity, flexibility, +implementation simplicity, and low proof impact. There are no changes in how +domains are treated currently in the implementation at runtime or in the proofs +for information flow. The current information flow proofs should be fully +preserved in systems that no longer have the domain capability, which is already +assumed in the current proofs. + +The two extreme choices in the design space would be to leave the static +schedule as is and to remove domain scheduler completely. This RFC has already +motivated why the former is not a good option. Removing the domain scheduler +from the kernel completely and delegating partition scheduling to user level +would invalidate the information flow theorems for seL4 without any even +conceptual formal remedy for the statement of these theorems. Isolation would +no longer be a property of seL4. + +Other points in the design space that have been considered are the following. + +### Batched setting of the domain schedule + +The current proposal requires multiple kernel calls to set a new schedule -- one +for each entry in the new domain schedule. An alternative would be to batch +multiple such calls and collect an entire schedule or larger parts of a schedule +in the IPC buffer. Since setting new schedules, as opposed to switching between +existing schedules, is unlikely to be a performance critical operation. That +means, added complexity of a batched API with variable length arguments brings +no tangible benefit for system performance or even user convenience. + +### Restricting to two schedules + +Instead of using one array with end markers, the domain schedule could be +represented as two (or more) separate arrays: one active, running schedule, and +one (or more) inactive schedule that can be edited. The representation proposed +in this RFC achieves the same overall API with more flexibility and less +storage. + +### Allowing zero-length schedules + +The API proposed in this RFC forbids schedules of length zero -- these are +schedules where the start directly points to an end marker. An alternative would +be to allow such schedules and to treat all entries with duration 0 as end +markers. A zero-length schedule with an end marker for a domain `d` would then +mean that the kernel will stay in domain `d` indefinitely until a new schedule +is set, practically disabling domain scheduling at runtime. The RFC did not +choose this option, because a one-element schedule with very long domain time +(e.g. years) already achieves the same result in practice and has a simpler, +more efficient implementation. Allowing zero-length schedules would require +additional state to be checked any time the kernel reduces the consumed domain +time. + +### Point of activation of setting domain/duration + +In the proposed API domain schedule *switches* take effect immediately, but +setting domain and duration of an entry in the currently running schedule takes +effect only the next time the schedule wraps around to the current index. + +This behaviour has the simpler implementation and is also simpler conceptually. +Making domain/duration changes take effect immediately would mean that the +system call might have to switch immediately to a new domain when either the new +domain does not match the current domain or the new duration is shorter than the +already expired domain time. Either would mean that the kernel has already +violated the new domain schedule and is attempting to catch up and rectify the +violation. If the semantics instead is that the effect takes place at the next +time the index is visited, the kernel never needs to violate the schedule. + +The behaviour difference is largely theoretical -- the recommended use of the +API is to never edit the current schedule, but instead edit a new schedule and +switch to the new schedule atomically. + +## Prior art + +Most separation kernels provide some form of configurable static scheduling. + +The motivation for the original domain scheduler design in seL4 was an explicit +customer/user request for a static separation kernel configuration of seL4 with +static scheduling in a security and information flow context. This also +motivated the current form of the information flow theorem. For the application +at the time, the current API was sufficient, but for a more general use case it +is too restrictive and has too much build system impact, because it needs +recompilation for schedule changes. + +Another application area with similar requirements is aviation, in particular +the ARINC 653 standard, which would be better supported by this RFC than the +current implementation. + +This RFC is not attempting to provide a full implementation of any particular +standard, it merely aims to make the existing API more usable and improve the +overall developer experience on seL4. + + +## Unresolved questions + +The API is expressed in terms of timer ticks in expectation of an additional +upcoming RFC to change the rest of the API to timer ticks as well. Depending on +the outcome of that discussion, the API could also be in terms of `time_t` +instead if required for consistency.