Skip to content

missing offset metrics in the kernel and prometheus node exporter #2202

@anarcat

Description

@anarcat

We monitor our servers with the Prometheus node exporter which provides a node_timex_offset_seconds metric, itself derived from the kernel's timex.offset field (see timex.h).

As far as I can tell, ntpsec and timesyncd update that field, but neither does chrony or ntpd-rs. Or at least the node exporter systematically report this field as being zero. I'm not sure what value that should be since this is of course server-dependent (at least as far as I understand it), but i'm pretty sure the offset is never zero.

Here is, for example, the output of ntp-ctl on my home server:

anarcat@marcos:~$ ntp-ctl  status
Synchronization status:
Dispersion: 0.000947s, Delay: 0.029867s
Stratum: 5

Sources:
ntpd-rs.pool.ntp.org:123/51.161.47.242:123 (1): -0.000387±0.001606(±0.044428)s
    poll interval: 256s, missing polls: 0
    root dispersion: 0.000015s, root delay:0.000015s
ntpd-rs.pool.ntp.org:123/216.128.178.20:123 (2): +0.002955±0.002456(±0.030494)s
    poll interval: 1024s, missing polls: 0
    root dispersion: 0.008835s, root delay:0.019379s
ntpd-rs.pool.ntp.org:123/208.81.1.244:123 (3): +0.009297±0.001234(±0.050555)s
    poll interval: 1024s, missing polls: 0
    root dispersion: 0.023193s, root delay:0.033997s
ntpd-rs.pool.ntp.org:123/167.160.187.12:123 (4): +0.002160±0.001569(±0.028479)s
    poll interval: 1024s, missing polls: 0
    root dispersion: 0.000763s, root delay:0.001389s

Servers:

here, i think the "offset" should be 0.000947s, or about 947µs. But I don't actually know: this interface in the linux kernel is not well documented, to say the least...

Is it possible that ntpd-rs is not reporting those numbers correctly to the linux kernel?

In our case, we have the following alerts we use to monitor for clock errors on our servers (which currently never fire, when using ntpsec). we worry that one of those alert would stop working and fail to detect certain error conditions:

  - alert: HostClockSkew
    expr: ((node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Host clock skew on {{ $labels.alias }}"
      description: |
        The kernel's clock is skewed by more than 0.05s ({{ $value | humanizeDuration }})
        and continuing to drift. Ensure NTP is configured correctly on {{ $labels.alias }}.
      playbook: "https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/incident-response#host-clock-desynchronized"

  - alert: HostClockNotSynchronizing
    expr: (min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "NTP is failing to synchronize the host clock on {{ $labels.alias }}"
      description: |
        Clock not synchronising. Ensure NTP is configured and running properly on {{ $labels.alias }}.
      playbook: "https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/incident-response#host-clock-desynchronized"

we've reviewed ntp-rs as a replacement for timesyncd and ntpsec (and also compared it with chrony), and it seems pretty darn good, by the way! see my comment in here : https://gitlab.torproject.org/tpo/tpa/team/-/issues/41936#note_3394545

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions