-
Notifications
You must be signed in to change notification settings - Fork 45
Description
lib/ood_core/job/adapters/systemd.rb defines:
def self.build_systemd(config)
c = config.to_h.symbolize_keys
debug = c.fetch(:debug, false)
max_timeout = c.fetch(:max_timeout, nil)
ssh_hosts = c.fetch(:ssh_hosts, [c[:submit_host]])
strict_host_checking = c.fetch(:strict_host_checking, true)
submit_host = c[:submit_host]
ssh_keyfile = c.fetch(:ssh_keyfile, "")
Adapters::LinuxSystemd.new(
ssh_hosts: ssh_hosts,
launcher: Adapters::LinuxSystemd::Launcher.new(
debug: debug,
max_timeout: max_timeout,
ssh_hosts: ssh_hosts,
strict_host_checking: strict_host_checking,
submit_host: submit_host,
ssh_keyfile: ssh_keyfile,
)
)
end
endOf note is that max_timeout is extracted from config and then passed to Adapters::LinuxSystemd::Launcher.new constructor.
However lib/ood_core/job/adapters/systemd/launcher.rb defines the class and constructor as:
class OodCore::Job::Adapters::LinuxSystemd::Launcher
attr_reader :debug, :site_timeout, :session_name_label, :ssh_hosts,
:strict_host_checking, :username, :ssh_keyfile
def initialize(
debug: false,
site_timeout: nil,
ssh_hosts:,
strict_host_checking: false,
submit_host:,
ssh_keyfile: "",
**_
)
@debug = !! debug
@site_timeout = site_timeout.to_i
@session_name_label = 'ondemand'
@ssh_hosts = ssh_hosts
@strict_host_checking = strict_host_checking
@submit_host = submit_host
@username = Etc.getlogin
@ssh_keyfile = ssh_keyfile
endHere, there is no max_timeout parameter in the constructor. Instead, the constructor expects site_timeout. The max_timeout parameter passed to the constructor is gulped, and ignored by, the ** parameter.
Confusing things further, the systemd documentation page (source at systemd.rst in OSC/ood-documentation states that there is a timeout parameter, which doesn't seem to be supported in-code, and has an 'Example Cluster Configuration' uses site_timeout.
There is no way of setting site_timeout. This can be demonstrated by creating a cluster configuration file with the 'systemd' adapter e.g. and defining values for max_timeout, site_timeout, and, for completeness, timeout:
---
v2:
metadata:
title: "my-host (systemd)"
hidden: false
login:
host: "my-host"
job:
adapter: "systemd"
submit_host: "my-host"
ssh_hosts:
- my-host.myorg.ac.uk
max_timeout: 1234
site_timeout: 5678
timeout: 9123
debug: true
strict_host_checking: false
ssh_keyfile: "~/.ssh/id_rsa"Update lib/ood_core/job/adapters/systemd.rb's self.build_systemd(config) function to print the value of the max_timeout parameter it extracts and passes to the Adapters::LinuxSystemd::Launcher.new constructor, as well as the values of the other parameters:
puts("DEBUG (max_timeout): " + max_timeout.to_s)
puts("DEBUG (site_timeout): " + c.fetch(:site_timeout, nil).to_s)
puts("DEBUG (timeout): " + c.fetch(:timeout, nil).to_s)Update lib/ood_core/job/adapters/systemd/launcher.rb's constructor to print the value of its site_timeout parameter:
puts("DEBUG (launcher's site_timeout): " + @site_timeout.to_s)Running an app that uses this cluster configuration logs, as expected, from systemd.rb:
App 2862586 output: DEBUG (max_timeout): 1234
App 2862586 output: DEBUG (site_timeout): 5678
App 2862586 output: DEBUG (timeout): 9123
But from launcher.rb:
App 2870202 output: DEBUG (launcher's site_timeout): 0
Defining max_timeout only in the cluster configuration gives:
App 2874507 output: DEBUG (max_timeout): 1234
App 2874507 output: DEBUG (site_timeout):
App 2874507 output: DEBUG (timeout):
App 2874507 output: DEBUG (launcher's site_timeout): 0
Defining site_timeout only in the cluster configuration gives:
App 2877021 output: DEBUG (max_timeout):
App 2877021 output: DEBUG (site_timeout): 5678
App 2877021 output: DEBUG (timeout):
App 2877021 output: DEBUG (launcher's site_timeout): 0
And defining timeout only in the cluster configuration gives:
App 2879361 output: DEBUG (max_timeout):
App 2879361 output: DEBUG (site_timeout):
App 2879361 output: DEBUG (timeout): 9123
App 2879361 output: DEBUG (launcher's site_timeout): 0
In no case is site_timeout set.
It looks like this bug is what gave rise to the problems described in Discourse Systemd adaptor job is not timing out (2025).
A fix is to:
- Rename
max_timeouttosite_timeoutin lib/ood_core/job/adapters/systemd.rb. - Rename
timeoutandmax_timeouttosite_timeoutin systemd.rst in OSC/ood-documentation.
The tests in test/job/adapters/systemd_test.rb and test/job/adapters/systemd_launcher_test.rb use empty configurations so it doesn't look like any changes are needed there.