missing offset metrics in the kernel and prometheus node exporter

We monitor our servers with the [Prometheus node exporter](https://github.com/prometheus/node_exporter/) which provides a `node_timex_offset_seconds` metric, itself derived from the kernel's `timex.offset` field (see [timex.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/timex.h)).

As far as I can tell, ntpsec and timesyncd update that field, but neither does chrony or ntpd-rs. Or at least the node exporter systematically report this field as being zero. I'm not sure *what* value that should be since this is of course server-dependent (at least as far as I understand it), but i'm pretty sure the offset is never zero.

Here is, for example, the output of `ntp-ctl` on my home server:

```
anarcat@marcos:~$ ntp-ctl  status
Synchronization status:
Dispersion: 0.000947s, Delay: 0.029867s
Stratum: 5

Sources:
ntpd-rs.pool.ntp.org:123/51.161.47.242:123 (1): -0.000387±0.001606(±0.044428)s
    poll interval: 256s, missing polls: 0
    root dispersion: 0.000015s, root delay:0.000015s
ntpd-rs.pool.ntp.org:123/216.128.178.20:123 (2): +0.002955±0.002456(±0.030494)s
    poll interval: 1024s, missing polls: 0
    root dispersion: 0.008835s, root delay:0.019379s
ntpd-rs.pool.ntp.org:123/208.81.1.244:123 (3): +0.009297±0.001234(±0.050555)s
    poll interval: 1024s, missing polls: 0
    root dispersion: 0.023193s, root delay:0.033997s
ntpd-rs.pool.ntp.org:123/167.160.187.12:123 (4): +0.002160±0.001569(±0.028479)s
    poll interval: 1024s, missing polls: 0
    root dispersion: 0.000763s, root delay:0.001389s

Servers:
```

here, i think the "offset" should be `0.000947s`, or about 947µs. But I don't actually know: this interface in the linux kernel is not well documented, to say the least...

Is it possible that ntpd-rs is not reporting those numbers correctly to the linux kernel?

In our case, we have the following alerts we use to monitor for clock errors on our servers (which currently never fire, when using ntpsec). we worry that one of those alert would stop working and fail to detect certain error conditions:

```yaml
  - alert: HostClockSkew
    expr: ((node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Host clock skew on {{ $labels.alias }}"
      description: |
        The kernel's clock is skewed by more than 0.05s ({{ $value | humanizeDuration }})
        and continuing to drift. Ensure NTP is configured correctly on {{ $labels.alias }}.
      playbook: "https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/incident-response#host-clock-desynchronized"

  - alert: HostClockNotSynchronizing
    expr: (min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "NTP is failing to synchronize the host clock on {{ $labels.alias }}"
      description: |
        Clock not synchronising. Ensure NTP is configured and running properly on {{ $labels.alias }}.
      playbook: "https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/incident-response#host-clock-desynchronized"
```

we've reviewed ntp-rs as a replacement for timesyncd and ntpsec (and also compared it with chrony), and it seems pretty darn good, by the way! see my comment in here : https://gitlab.torproject.org/tpo/tpa/team/-/issues/41936#note_3394545

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

missing offset metrics in the kernel and prometheus node exporter #2202

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

missing offset metrics in the kernel and prometheus node exporter #2202

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions