Skip to content

Fix bug in 'systemd' adapter timeout-related configuration code #912

@mikej888

Description

@mikej888

lib/ood_core/job/adapters/systemd.rb defines:

      def self.build_systemd(config)
        c = config.to_h.symbolize_keys
        debug = c.fetch(:debug, false)
        max_timeout = c.fetch(:max_timeout, nil)
        ssh_hosts = c.fetch(:ssh_hosts, [c[:submit_host]])
        strict_host_checking = c.fetch(:strict_host_checking, true)
        submit_host = c[:submit_host]
        ssh_keyfile = c.fetch(:ssh_keyfile, "")

        Adapters::LinuxSystemd.new(
          ssh_hosts: ssh_hosts,
          launcher: Adapters::LinuxSystemd::Launcher.new(
            debug: debug,
            max_timeout: max_timeout,
            ssh_hosts: ssh_hosts,
            strict_host_checking: strict_host_checking,
            submit_host: submit_host,
            ssh_keyfile: ssh_keyfile,
          )
        )
      end
    end

Of note is that max_timeout is extracted from config and then passed to Adapters::LinuxSystemd::Launcher.new constructor.

However lib/ood_core/job/adapters/systemd/launcher.rb defines the class and constructor as:

class OodCore::Job::Adapters::LinuxSystemd::Launcher
  attr_reader :debug, :site_timeout, :session_name_label, :ssh_hosts,
    :strict_host_checking, :username, :ssh_keyfile

  def initialize(
    debug: false,
    site_timeout: nil,
    ssh_hosts:,
    strict_host_checking: false,
    submit_host:,
    ssh_keyfile: "",
    **_
  )
    @debug = !! debug
    @site_timeout = site_timeout.to_i
    @session_name_label = 'ondemand'
    @ssh_hosts = ssh_hosts
    @strict_host_checking = strict_host_checking
    @submit_host = submit_host
    @username = Etc.getlogin
    @ssh_keyfile = ssh_keyfile
  end

Here, there is no max_timeout parameter in the constructor. Instead, the constructor expects site_timeout. The max_timeout parameter passed to the constructor is gulped, and ignored by, the ** parameter.

Confusing things further, the systemd documentation page (source at systemd.rst in OSC/ood-documentation states that there is a timeout parameter, which doesn't seem to be supported in-code, and has an 'Example Cluster Configuration' uses site_timeout.

There is no way of setting site_timeout. This can be demonstrated by creating a cluster configuration file with the 'systemd' adapter e.g. and defining values for max_timeout, site_timeout, and, for completeness, timeout:

---
v2:
  metadata:
    title: "my-host (systemd)"
    hidden: false
  login:
    host: "my-host"
  job:
    adapter: "systemd"
    submit_host: "my-host"
    ssh_hosts:
      - my-host.myorg.ac.uk
    max_timeout: 1234
    site_timeout: 5678
    timeout: 9123
    debug: true
    strict_host_checking: false
    ssh_keyfile: "~/.ssh/id_rsa"

Update lib/ood_core/job/adapters/systemd.rb's self.build_systemd(config) function to print the value of the max_timeout parameter it extracts and passes to the Adapters::LinuxSystemd::Launcher.new constructor, as well as the values of the other parameters:

        puts("DEBUG (max_timeout): " + max_timeout.to_s)
        puts("DEBUG (site_timeout): " + c.fetch(:site_timeout, nil).to_s)
        puts("DEBUG (timeout): " + c.fetch(:timeout, nil).to_s)

Update lib/ood_core/job/adapters/systemd/launcher.rb's constructor to print the value of its site_timeout parameter:

    puts("DEBUG (launcher's site_timeout): " + @site_timeout.to_s)

Running an app that uses this cluster configuration logs, as expected, from systemd.rb:

App 2862586 output: DEBUG (max_timeout): 1234
App 2862586 output: DEBUG (site_timeout): 5678
App 2862586 output: DEBUG (timeout): 9123

But from launcher.rb:

App 2870202 output: DEBUG (launcher's site_timeout): 0

Defining max_timeout only in the cluster configuration gives:

App 2874507 output: DEBUG (max_timeout): 1234
App 2874507 output: DEBUG (site_timeout): 
App 2874507 output: DEBUG (timeout): 
App 2874507 output: DEBUG (launcher's site_timeout): 0

Defining site_timeout only in the cluster configuration gives:

App 2877021 output: DEBUG (max_timeout): 
App 2877021 output: DEBUG (site_timeout): 5678
App 2877021 output: DEBUG (timeout): 
App 2877021 output: DEBUG (launcher's site_timeout): 0

And defining timeout only in the cluster configuration gives:

App 2879361 output: DEBUG (max_timeout): 
App 2879361 output: DEBUG (site_timeout): 
App 2879361 output: DEBUG (timeout): 9123
App 2879361 output: DEBUG (launcher's site_timeout): 0

In no case is site_timeout set.

It looks like this bug is what gave rise to the problems described in Discourse Systemd adaptor job is not timing out (2025).

A fix is to:

The tests in test/job/adapters/systemd_test.rb and test/job/adapters/systemd_launcher_test.rb use empty configurations so it doesn't look like any changes are needed there.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions