Skip to content

Conversation

benjirewis
Copy link
Member

@benjirewis benjirewis commented Sep 26, 2025

RSDK-11248:
Adds similar restart checking logic to the logic disabled in viamrobotics/rdk#5324; the logic now restarts viam-agent and all its other subsystems along with viam-server. Refactors hits of restart_status to be more generic (can query multiple restart "properties"). Adds a Property method to the Subsystem interface.

RSDK-11266:
Logs the reason that viam-agent is exiting (either a received signal or a requested restart from app).

RSDK-11900:
Stops logging a stack trace from viam-server on part restart. A stack trace will only be logged from viam-server when shutdown is hanging.

RSDK-11901:
Stops logging any ERROR-level logs for restarts (so no error logs appear in machine settings control card).

Manual testing

Note that I'm leveraging a does_not_handle_needs_restart JSON property exposed on the restart_status endpoint introduced in the associated RDK PR along with a VIAM_AGENT_HANDLES_NEEDS_RESTART environment variable set by agent. The expected behavior right now is:

Old viam-agent New viam-agent
Old viam-server viam-server should handle restart checking viam-server should handle restart checking
New viam-server viam-server should handle restart checking viam-agent should handle restart checking

I've tested the above matrix on both Linux and Windows, and the behavior is as-expected. I've also ensured that I haven't broken the existing "restart allowed" logic.


connMu sync.RWMutex
conn rpc.ClientConn
client pb.AgentDeviceServiceClient
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this was a pointless field to store on the manager; creating a gRPC client on top of conn above through which to call DeviceAgentConfig requires no actual, blocking work. It was confusing to store this variable on the struct and check its existence to see if we had dialed already.


if needRestart || needRestartConfigChange || m.viamServerNeedsRestart || m.viamAgentNeedsRestart {
if m.viamServer.(viamserver.RestartCheck).SafeToRestart(ctx) {
if m.viamServer.Property(ctx, viamserver.RestartPropertyRestartAllowed) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the RestartCheck interface, and instead added a Property method to the Subsystem interface. I think it's a bit easier to read. I also moved logging about the result of querying restart_allowed to this file (line below this).

// The minimal (and default) interval for checking for config updates via DeviceAgentConfig.
minimalDeviceAgentConfigCheckInterval = time.Second * 5
// The minimal (and default) interval for checking whether agent needs to be restarted.
minimalNeedsRestartCheckInterval = time.Second * 1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have two background check goroutines: one checks for a new config every 5s (existing), and another checks for a restart every 1s (new). Each check can receive a new, different interval from the app call, so they need to be running at different cadences in different goroutines. You'll also notice that I renamed some interval variable names in this file to be more specific as to which "interval" they were associated with.

slowWatcher, slowWatcherCancel := goutils.SlowGoroutineWatcher(
stopAllTimeout, "Agent is taking a while to shut down,", m.logger)
stopAllTimeout,
fmt.Sprintf("Viam agent subsystems and/or background workers failed to shut down within %v", stopAllTimeout),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[drive-by] This log was getting output after agent shutdown timed out, so the message was slightly inaccurate.

}
// As with the device agent config check interval, randomly fuzz the interval by
// +/- 5%.
timer.Reset(utils.FuzzTime(needsRestartCheckInterval, 0.05))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were doing this for the config check interval, too... I'm not sure why; anyone know?

HealthCheck(ctx context.Context) error

// Property gets an arbitrary property about the running subystem.
Property(ctx context.Context, property string) bool
Copy link
Member Author

@benjirewis benjirewis Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the new method I added. I realize the interface here is meant to reveal a limited API from manager -> subsystems, but I found that the manager truly needed to know a couple "properties" of the running viamserver subsystem (whether restart was currently allowed and whether viamserver was already handling restart checking logic), so I thought this was worth adding despite it opening a pretty generic API to subsystems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I think the interface duck typing that's already there was a mistake and we should just make viamserver.viamServer public, put it directly in the manager struct as a pointer, and remove all the interface casting. I also don't think it's worth holding up this PR though so I'm in favor of just continuing as-is and I can make a follow up PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing it entirely is also fair; filed https://viam.atlassian.net/browse/RSDK-12263.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved most the code that attempted to find whether restart was allowed here, and I extended it to also be able to check whether viamserver handled restart checking logic. Some of the concurrent code could use an extra pair of eyes since it's slightly different than before; my manual testing seems to imply it works fine.

RestartAllowed bool `json:"restart_allowed"`
// DoesNotHandleNeedsRestart represents whether this instance of the viamserver does
// not check for the need to restart against app itself and, thus, needs agent to do so.
// Newer versions of viamserver (>= v0.9x.0) will report true for this value, while
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case you're curious, I don't think we can check the semantic version string of viamserver to know this information. That string is not always a semantic version (could be stable, customURL..., etc.), so I felt that the best way to know was to directly query something on the running viam-server.

goodBytes = bytes.Equal(shasum, verData.UnpackedSHA)
} else {
c.logger.Warn(err)
} else if verData.UnpackedPath != "" { // custom file:// URLs with have an empty unpacked path; no need to warn
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[drive-by] Logging fix related to comment.

}
// Only log errors from Wait() or the exit code of the process state if subsystem
// exited unexpectly (was not stopped by agent and is therfore still marked as
// shouldRun).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were getting unnecessary error logs here when restarting. A non-zero exit code when we sent SIGQUIT to viam-server is expected.

fallthrough
case syscall.SIGTERM:
globalLogger.Info("exiting")
globalLogger.Infof("Signal %s was received. %s will now exit to be restarted by service manager",
Copy link
Member Author

@benjirewis benjirewis Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change, and the addition and usage of reason for the Exit method in manager.go are for RSDK-11266. Hopefully these messages will be useful for debugging.

@benjirewis benjirewis changed the title RSDK-11248 Add restart checking logic RSDK-11248 RSDK-11266 Add restart checking logic Sep 29, 2025
@benjirewis benjirewis requested review from cheukt and jmatth September 29, 2025 15:59
@benjirewis benjirewis marked this pull request as ready for review September 29, 2025 15:59
// set before starting viam-server to indicate that agent is a new enough version to
// have its own background loop that runs NeedsRestart against app.viam.com to determine
// if the system needs a restart. MUST be kept in line with the equivalent value in the
// rdk repo.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now continue to handle checking NeedsRestart in a background goroutine in the server if VIAM_AGENT_HANDLES_NEEDS_RESTART_CHECKING is unset (newer versions of agent will set it unconditionally when launching viam-server). It was determined offline that we do want to support old-agent/new-server setups for a while, and eventually completely remove needs restart checking functionality from the rdk with RSDK-12057.

Thanks to @jmatth for the env var idea 🎉 .

Copy link
Member Author

@benjirewis benjirewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some manual testing to ensure I didn't break "restart allowed" logic (turns out it wasn't even working on main for Windows) and that Windows behaved as expected for the feature (across old/new agent/server versions).

I'm also going to add some unit tests to this PR.

s.checkURL = matches[1]
s.checkURLAlt = strings.Replace(matches[2], "0.0.0.0", "localhost", 1)
s.logger.Infof("viam-server restart allowed check URLs: %s %s", s.checkURL, s.checkURLAlt)
s.checkURLAlt = strings.Replace(matches[2], "0.0.0.0", "127.0.0.1", 1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to 127.0.0.1 so checks to checkURLAlt would work on Windows (localhost does not exist on Windows) as well as Linux. Restart allowed checks could already work with the checkURL (this was something like https://vader-main.ydbecmp2sz.local.viam.cloud:8080).

switch property {
case RestartPropertyRestartAllowed:
if !s.running {
// Assume agent can restart viamserver if the subsystem is not running.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to also return true unconditionally on Windows, meaning a maintenance sensor returning false or an active reconfiguration would not stop viam-agent from restarting viam-server for an update on Windows. That's no longer the case, and both of those things will prohibit viam-agent with these changes from restarting viam-server (the WARN log viam-server has NOT allowed a restart; will NOT restart will be output).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reason for always returning true on Windows? I assume it was working around some weird behavior and want to make sure we're not causing a regression

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard-coding true was introduced in this PR. I'll check to see if that's no longer relevant, but restart_allowed seems to function fine in my manual testing on Windows.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put the hard-coding back for now and filed https://viam.atlassian.net/browse/RSDK-12271 to re-enable at some point in the future. Re-enabling a feature that caused a lot of Windows issues previously as part of this already-large PR is probably too large of a risk.

@benjirewis benjirewis changed the title RSDK-11248 RSDK-11266 Add restart checking logic RSDK-11248 RSDK-11266 RSDK-11901 Add restart checking logic Oct 2, 2025
@benjirewis benjirewis changed the title RSDK-11248 RSDK-11266 RSDK-11901 Add restart checking logic RSDK-11248 RSDK-11266 RSDK-11900 RSDK-11901 Add restart checking logic Oct 2, 2025
Copy link
Member

@cheukt cheukt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending tests

@benjirewis benjirewis force-pushed the update-restart-button-rsdk-11248 branch from 5ea76ae to b5d1d6f Compare October 13, 2025 17:35
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some unit tests for checkRestartProperty, lmk what you think @jmatth .

Comment on lines 216 to 217
globalLogger.Infof("Signal %s was received. %s will now exit to be restarted by service manager",
sig, agent.SubsystemName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use structured logging instead of format strings for any parameterized logs. agent.SubsystemName is a bit weird here because it's just a constant so doesn't need to be a log field but formatting it into the log message with fmt.Sprintf and to then be passed to the logger method is annoying. If you'd prefer to just make it another parameter for convenience I'd be fine with it.

The following is a code block instead of a suggestion because Github's code review tool is bad and I can't figure out how to tell it to apply a suggestion to two lines instead of one.

				globalLogger.Infof("Signal %s was received. %s will now exit to be restarted by service manager",
					sig, agent.SubsystemName)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do. I believe you just copied the existing line into that code block but I get the gist.

HealthCheck(ctx context.Context) error

// Property gets an arbitrary property about the running subystem.
Property(ctx context.Context, property string) bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I think the interface duck typing that's already there was a mistake and we should just make viamserver.viamServer public, put it directly in the manager struct as a pointer, and remove all the interface casting. I also don't think it's worth holding up this PR though so I'm in favor of just continuing as-is and I can make a follow up PR.

switch property {
case RestartPropertyRestartAllowed:
if !s.running {
// Assume agent can restart viamserver if the subsystem is not running.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reason for always returning true on Windows? I assume it was working around some weird behavior and want to make sure we're not causing a regression

wg.Wait()
})

s.mu.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] If we actually need to take the lock I'd just do

s.mu.Lock()
t.Cleanup(s.mu.Unlock)

to avoid having to keep track of the lock/unlock calls

@benjirewis benjirewis requested a review from jmatth October 15, 2025 16:27
Copy link
Member

@jmatth jmatth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the concurrency in restart_properties will do the right thing but dislike that it means every potential URL now has to produce an error or nil before we check for a successful response and think the double channels will make the code hard to maintain. I threw together a version that returns an error or result per URL over a single channel here: jmatth@2e8d8b5. I'd prefer that but don't want to hold up this PR any more so am approving now and am fine merging either version of the code.

Also thank you for writing tests for that code, I'm pretty sure I noticed a bug in the old version when I was comparing the two so they were definitely needed 😬

@benjirewis
Copy link
Member Author

I threw together a version that returns an error or result per URL over a single channel here: jmatth@2e8d8b5.

Discussed offline: I put that commit in and one commit on top of it with some "fixes." Thanks! It's a cool Rust-like pattern.

@benjirewis benjirewis merged commit 9f199dd into viamrobotics:main Oct 16, 2025
3 checks passed
@benjirewis benjirewis deleted the update-restart-button-rsdk-11248 branch October 16, 2025 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants