Skip to content

feat(dgw): basic network monitoring #1446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

philippelatulippe
Copy link

@philippelatulippe philippelatulippe commented Jul 30, 2025

Introduces an endpoint monitor that checks the uptime of a list of hosts provided through a remotely-submitted configuration file. The monitor results are saved to a temporary in-memory buffer and can be fetched by means of a REST endpoint.

The monitor system is structured as an agent, meant to be driven by a third party (for example our DVLS), so the configuration is ephemeral and monitor results are deleted from the buffer after being fetched. Gateway is not itself the source of truth for the monitor configuration, and it does not persist the log entries.

Two authenticated endpoints are introduced:

  • POST /jet/monitoring/config
  • POST /jet/monitoring/log/drain

Example requests for testing:

curl -v http://127.0.0.1:7171/jet/monitoring/config \
    --json '{"monitors":[{"id":"monitor1","probe":"ping","address":"localhost","interval":10,"timeout":10}]}' \
    -H "Authorization: Bearer $dgwkey"
curl  http://127.0.0.1:7171/jet/monitoring/log/drain -X POST -H "Authorization: Bearer $dgwkey"

Remaining tasks for a basic scanner:

  • We should parse the ICMP responses
  • Better structuring of errors so that we can have error codes in the JSON, instead of arbitrary strings
  • Upper bound on the number of log entries stored in the VecQueue

Issue: DGW-302

@philippelatulippe philippelatulippe self-assigned this Jul 30, 2025
Copy link

Let maintainers know that an action is required on their side

  • Add the label release-required Please cut a new release (Devolutions Gateway, Devolutions Agent, Jetsocat, PowerShell module) when you request a maintainer to cut a new release (Devolutions Gateway, Devolutions Agent, Jetsocat, PowerShell module)

  • Add the label release-blocker Follow-up is required before cutting a new release if a follow-up is required before cutting a new release

  • Add the label publish-required Please publish libraries (`Devolutions.Gateway.Utils`, OpenAPI clients, etc) when you request a maintainer to publish libraries (Devolutions.Gateway.Utils, OpenAPI clients, etc.)

  • Add the label publish-blocker Follow-up is required before publishing libraries if a follow-up is required before publishing libraries

@philippelatulippe philippelatulippe marked this pull request as ready for review July 30, 2025 10:38
@philippelatulippe philippelatulippe force-pushed the network-monitor-config-endpoint branch from ac51029 to ec20f65 Compare July 30, 2025 10:44
@@ -54,6 +57,7 @@ pub struct DgwState {
pub recordings: recording::RecordingMessageSender,
pub job_queue_handle: job_queue::JobQueueHandle,
pub credential_store: credential::CredentialStoreHandle,
pub monitoring_state: Arc<network_monitor::State>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I avoided my reflex to implement all functionality inside of an object, and instead have an opaque state object that is passed to my functions, since that seemed to be what was done here. Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems ok to me. Should the network monitor be feature gated?

Copy link
Member

@CBenoit CBenoit Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean adding a compile-time feature flag, then I would suggest we don’t. If you suggest to hide the feature behind the unstable flag, then I agree we could do that so we can tweak the API across the minor releases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did mean compile-time feature flag. I agree with what you are saying now.

let cancellation_token = CancellationToken::new();
let cancellation_monitor = cancellation_token.clone();

let definition_clone = (*definition).clone(); // TODO: is there a nicer way to do this?
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember right, I have to clone definition here because its owner will not outlive the end of the function. But I suspect there might be a way to do map so that definition is already a clone?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can clone before collecting.

pub async fn set_config(config: MonitorsConfig, state: Arc<State>) -> Result<(), SetConfigError> {
    let file = fs::OpenOptions::new()
        .create(true)
        .write(true)
        .truncate(true)
        .open(get_monitor_config_path(&state.cache_path))?;

    let mut config_write = state.config.write().await;

    serde_json::to_writer_pretty(&file, &config)?;

    let old_config = mem::replace(&mut *config_write, config);

    // clone here
    let new_config_set = config_write.monitors.clone().into_iter().collect::<HashSet<_>>();
    let old_config_set = old_config.monitors.clone().into_iter().collect::<HashSet<_>>();

    // `Iterator::cloned` here
    let added = new_config_set.difference(&old_config_set).cloned();
    let deleted = old_config_set.difference(&new_config_set);

    let (new_cancellation_tokens, new_monitors): (Vec<_>, Vec<_>) = added
        .into_iter()
        .map(|definition| {
            let cancellation_token = CancellationToken::new();
            let cancellation_monitor = cancellation_token.clone();

            // my personal preference is to use `Arc::clone`
            // https://rust-lang.github.io/rust-clippy/stable/index.html#/clone_on_ref_ptr
            let state = Arc::clone(&state);
            // clone id before moving definition into the async block
            let definition_id = definition.id.clone();
            let monitor = async move {
                loop {
                    let start_time = Utc::now();

                    let monitor_result = match &definition.probe {
                        ProbeType::Ping => do_ping_monitor(&definition).await,
                        ProbeType::TcpOpen => do_tcpopen_monitor(&definition).await,
                        ProbeType::Unknown(_) => return, // TODO: shouldn't happen, they should be filtered out. Create a separate ProbeType enum without Unknown?
                    };

                    state.log.write(monitor_result);

                    let elapsed = Utc::now() - start_time;
                    let next_run_in = (definition.interval as f64 - elapsed.as_seconds_f64()).clamp(1.0, f64::INFINITY);
                    select! {
                        _ = cancellation_monitor.cancelled() => { () }
                        _ = tokio::time::sleep(Duration::from_secs_f64(next_run_in)) => { () }
                    }
                }
            };
            (
                (definition_id, cancellation_token),
                Box::pin(monitor) as Pin<Box<dyn Future<Output = ()> + Send>>,
            )
        })
        .unzip();

    let mut cancellation_tokens_write = state.cancellation_tokens.lock().await;

    for definition in deleted {
        cancellation_tokens_write[&definition.id].cancel();
        cancellation_tokens_write.remove(&definition.id);
    }

    for (monitor_id, cancellation_token) in new_cancellation_tokens {
        cancellation_tokens_write.insert(monitor_id, cancellation_token);
    }

    for monitoring_task in new_monitors {
        tokio::spawn(monitoring_task);
    }

    Ok(())
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, what is the magic that would make this work? I've called cloned() with a d, which I hadn't noticed existed probably due to how close it looks like just plain clone, but once I access definition in the async move closure, then I get a plethora of borrow errors. Is there a change in your code that I'm missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function in my last comment should work as is.

router.with_state(state)
}

#[debug_handler]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: This macro is only for debugging purposes, please remove it and the macros feature

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guides for axum recommended always adding this macro to get proper compiler diagnostic messages. It seems convenient to keep those messages on?

Copy link
Member

@CBenoit CBenoit Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that it was stated to be only intended for debugging and should not be used in production to avoid the compile-time overhead, and losing other optimizations opportunities.
It is nice to get better diagnostics, but it’s also a typical write once thing, and we’ll be compiling again and again, so I still stand for removing it once you get the correct signature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see in the doc that now it does not affect the runtime performance when built with --release, so my info may be outdated, but I would still prefer we don’t make the builds heavier

@@ -115,6 +116,7 @@ etherparse = "0.15"

# For KDC proxy
portpicker = "0.1"
network-monitor = { version = "0.1.0", path = "../crates/network-monitor" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Move to the "In-house" section (this is unrelated to the KDC proxy)

@@ -86,7 +87,7 @@ ngrok = "0.13"
# HTTP
hyper = "1.6"
hyper-util = { version = "0.1", features = ["tokio", "server", "server-auto"] }
axum = { version = "0.8", default-features = false, features = ["http1", "json", "ws", "query", "tracing", "tower-log"] }
axum = { version = "0.8", default-features = false, features = ["http1", "json", "ws", "query", "tracing", "tower-log", "macros"] }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: This feature is only for development purposes, let’s not include it in prod.

Suggested change
axum = { version = "0.8", default-features = false, features = ["http1", "json", "ws", "query", "tracing", "tower-log", "macros"] }
axum = { version = "0.8", default-features = false, features = ["http1", "json", "ws", "query", "tracing", "tower-log"] }

Comment on lines 68 to 77
let monitor_result = match definition_clone.probe {
ProbeType::Ping => do_ping_monitor(&definition_clone).await,
ProbeType::TcpOpen => do_tcpopen_monitor(&definition_clone).await,
ProbeType::Unknown(_) => return, // TODO: shouldn't happen, they should be filtered out. Create a separate ProbeType enum without Unknown?
};
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When parsing the config, I have an Unknown case in the enum for the sake of forward-compatibility. I'd like to filter these out at the start of this function, in order to avoid dealing with the Unknown case here. Can I do that without having to manually write a duplicate of the ProbeType enum? Can I start with an unknown with no unknown case, and then derive a version that has it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove the Unknown variant?

If it shouldn't happen, you can handle the error when deserializing.

Copy link
Author

@philippelatulippe philippelatulippe Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will happen as we introduce new monitor types. In the long-term I'd like gateway to return some json-structured warning the client that some if its monitor types weren't supported, so I do want them to be represented in the array of monitors. But after filtering them out I feel like the unknown case should be gone so that I don't have to handle it in my match. Maybe I can compose my ProbeType enum somehow?

Copy link
Member

@CBenoit CBenoit Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you could have two different enums for that…!

#[derive(Deserialize)]
#[serde(untagged)]
enum ProbeType {
    Known(KnownProbeType),
    Unknown(String),
}

#[derive(Deserialize)]
#[serde(rename_all = "snake_case")]
enum KnownProbeType {
    Ping,
    TcpOpen,
}

This cleanly separates the supported values (KnownProbeType) from fallbacks (Unknown(String)), letting serde match any unrecognized string to the Unknown variant.

I’ve used a similar pattern in the token module:

#[derive(Serialize, Deserialize, Clone, Debug, PartialEq, Eq)]
#[serde(untagged)]
pub enum ApplicationProtocol {
Known(Protocol),
Unknown(SmolStr),
}

But, for maximum flexibility, I think you should defer the parsing:

  • Deserialize into a more permissive intermediate type (e.g. String or serde_json::Value)
  • Convert to a strongly typed value using serde_json::from_value (or a manual match)
  • Gracefully handle invalid or unknown values (log them at debug! level and proceed)

This pattern helps keep your data models lenient at the edges while maintaining strong typing internally, which seems to be exactly what you need.

In fact, I’ve preferred this approach in similar contexts, I think it often results in cleaner separation between raw input and trusted data.
You might find some inspiration in how the preflight API module handles this:
https://github.com/Devolutions/devolutions-gateway/blob/master/devolutions-gateway/src/api/preflight.rs

Copy link
Member

@CBenoit CBenoit Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will happen as we introduce new monitor types. In the long-term I'd like gateway to return some json-structured warning the client that some if its monitor types weren't supported, so I do want them to be represented in the array of monitors.

By the way, the preflight API does that too.

Please have a look at the cookbook as well: https://github.com/Devolutions/devolutions-gateway/blob/master/docs/COOKBOOK.md#preflight-api

(I would appreciate if you could add a section for the monitor API there as well!)

@philippelatulippe philippelatulippe changed the title feat(dwg): Basic network monitoring feat(dgw): Basic network monitoring Jul 30, 2025
Comment on lines 85 to 91
let temp_dir = TempDir::new("dgw-network-monitor-test").expect("Could not create temp dir");
let temp_path: Utf8PathBuf = Utf8Path::from_path(temp_dir.path())
.expect("TempDir gave us a garbage path")
.to_path_buf();

let monitoring_state = Arc::new(network_monitor::State::mock(temp_path));
Copy link
Member

@CBenoit CBenoit Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: I would prefer we do not include an extra dependency (tempdir) just for mocking (testing). I’ll come back on this later. (EDIT: it would be fine as a "dev-dependency", but it’s not one right now)

Comment on lines 38 to 43
) -> Result<Json<MonitoringLogResponse>, http::StatusCode> {
Ok(Json(
MonitoringLogResponse {
entries: network_monitor::drain_log(monitoring_state).into()
}
))
Copy link
Member

@CBenoit CBenoit Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Not faillible, so you don’t need to return a Result

SetConfigError::Io(_) => http::StatusCode::INTERNAL_SERVER_ERROR,
SetConfigError::Serde(_) => http::StatusCode::INTERNAL_SERVER_ERROR
}
})
Copy link
Member

@CBenoit CBenoit Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Use crate::http::HttpError instead.

) -> Result<(), HttpError> {
let conf = conf_handle.get_conf();
let jrl_json = serde_json::to_string_pretty(&claims)
.map_err(HttpError::internal().with_msg("failed to serialize JRL").err())?;

This will solve your TODO as well, the returned error is already logged automatically in the IntoResponse of HttpError

}

#[debug_handler]
async fn handle_set_monitoring_config(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Please, add the openapi/utoipa documentation. You can check the crate::api::session module or another one, for reference.

Copy link
Author

@philippelatulippe philippelatulippe Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very cool! Some of my structs are in my network-monitor crate, so they need to have the ToSchema trait. How should I proceed, should my crate also depend on utoipa? Should that dependency be gated behind a feature? If yes then how is the flag passed from the main application to the crates? I've just read some of your other comments which render this issue moot

}

#[derive(Debug, Clone, Serialize)]
pub struct MonitoringLogResponse {
Copy link
Member

@CBenoit CBenoit Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Should not be part of the public API.

Suggested change
pub struct MonitoringLogResponse {
pub(crate) struct MonitoringLogResponse {

(Or private, but you may need that in the openapi declaration.)

Copy link
Member

@CBenoit CBenoit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! I just did a very high level review, I’ll look at the details of your crate later 🙂

Copy link
Contributor

@allan2 allan2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR! I left some notes.

I'd also suggest running cargo clippy -p network-monitor and fixing the lints.

}
}

pub fn mock(cache_path: Utf8PathBuf) -> State {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add #[cfg(test)].

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mock is created by the mock function in DgwState, which is not gated in this way. I'm not sure why though, maybe it should be? @CBenoit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one in Devolutions Gateway is #[doc(hidden)] to express a similar intention, but if we gate with #[cfg(test)], then the function is not available for the integration tests…
It’s not ideal though, I don’t really like that. But for the purpose of getting stuff done, I would say that it’s not a big deal, and we can surely do something about that later

MonitorsConfig { monitors: Vec::new() }
}

fn mock() -> MonitorsConfig {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add #[cfg(test)].

let cancellation_token = CancellationToken::new();
let cancellation_monitor = cancellation_token.clone();

let definition_clone = (*definition).clone(); // TODO: is there a nicer way to do this?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can clone before collecting.

pub async fn set_config(config: MonitorsConfig, state: Arc<State>) -> Result<(), SetConfigError> {
    let file = fs::OpenOptions::new()
        .create(true)
        .write(true)
        .truncate(true)
        .open(get_monitor_config_path(&state.cache_path))?;

    let mut config_write = state.config.write().await;

    serde_json::to_writer_pretty(&file, &config)?;

    let old_config = mem::replace(&mut *config_write, config);

    // clone here
    let new_config_set = config_write.monitors.clone().into_iter().collect::<HashSet<_>>();
    let old_config_set = old_config.monitors.clone().into_iter().collect::<HashSet<_>>();

    // `Iterator::cloned` here
    let added = new_config_set.difference(&old_config_set).cloned();
    let deleted = old_config_set.difference(&new_config_set);

    let (new_cancellation_tokens, new_monitors): (Vec<_>, Vec<_>) = added
        .into_iter()
        .map(|definition| {
            let cancellation_token = CancellationToken::new();
            let cancellation_monitor = cancellation_token.clone();

            // my personal preference is to use `Arc::clone`
            // https://rust-lang.github.io/rust-clippy/stable/index.html#/clone_on_ref_ptr
            let state = Arc::clone(&state);
            // clone id before moving definition into the async block
            let definition_id = definition.id.clone();
            let monitor = async move {
                loop {
                    let start_time = Utc::now();

                    let monitor_result = match &definition.probe {
                        ProbeType::Ping => do_ping_monitor(&definition).await,
                        ProbeType::TcpOpen => do_tcpopen_monitor(&definition).await,
                        ProbeType::Unknown(_) => return, // TODO: shouldn't happen, they should be filtered out. Create a separate ProbeType enum without Unknown?
                    };

                    state.log.write(monitor_result);

                    let elapsed = Utc::now() - start_time;
                    let next_run_in = (definition.interval as f64 - elapsed.as_seconds_f64()).clamp(1.0, f64::INFINITY);
                    select! {
                        _ = cancellation_monitor.cancelled() => { () }
                        _ = tokio::time::sleep(Duration::from_secs_f64(next_run_in)) => { () }
                    }
                }
            };
            (
                (definition_id, cancellation_token),
                Box::pin(monitor) as Pin<Box<dyn Future<Output = ()> + Send>>,
            )
        })
        .unzip();

    let mut cancellation_tokens_write = state.cancellation_tokens.lock().await;

    for definition in deleted {
        cancellation_tokens_write[&definition.id].cancel();
        cancellation_tokens_write.remove(&definition.id);
    }

    for (monitor_id, cancellation_token) in new_cancellation_tokens {
        cancellation_tokens_write.insert(monitor_id, cancellation_token);
    }

    for monitoring_task in new_monitors {
        tokio::spawn(monitoring_task);
    }

    Ok(())
}

Comment on lines 68 to 77
let monitor_result = match definition_clone.probe {
ProbeType::Ping => do_ping_monitor(&definition_clone).await,
ProbeType::TcpOpen => do_tcpopen_monitor(&definition_clone).await,
ProbeType::Unknown(_) => return, // TODO: shouldn't happen, they should be filtered out. Create a separate ProbeType enum without Unknown?
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove the Unknown variant?

If it shouldn't happen, you can handle the error when deserializing.

@@ -54,6 +57,7 @@ pub struct DgwState {
pub recordings: recording::RecordingMessageSender,
pub job_queue_handle: job_queue::JobQueueHandle,
pub credential_store: credential::CredentialStoreHandle,
pub monitoring_state: Arc<network_monitor::State>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems ok to me. Should the network monitor be feature gated?

}

#[cfg(test)]
mod test {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically tests

Comment on lines 3 to 4
version = "0.1.0"
edition = "2021"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can copy the Cargo.toml of other crates.

Version 0.0.0, publish = false, edition = 2024, and other fields.

let start_time = Utc::now();

let ping_result = async || -> anyhow::Result<TimeDelta> {
let runtime = network_scanner_net::runtime::Socket2Runtime::new(None)?;
Copy link
Contributor

@irvingoujAtDevolution irvingoujAtDevolution Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start this runtime is a bit expensive, I think it's better to put it outside of a single request, but keep it alive when a monitoring is requested, and reuse it from place to place. It's an Arc, so you can safely pass it around as a parameter with Arc::clone if necessary

},
}

impl Into<Vec<u8>> for Icmpv6Message {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Prefer implementing From<Icmpv6Message> for Vec<u8> when feasible (as in this case). This provides a more idiomatic interface and enables .into() automatically via the blanket Into implementation.

issue/style: However, using From/Into for binary encoding or parsing is considered unidiomatic. These traits are meant to express type conversions, not serialization or protocol-specific encodings. Encoding logic belongs in explicitly-named methods or dedicated traits (e.g., encode() or to_bytes()), which make the semantics clearer.

Copy link
Member

@CBenoit CBenoit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: While .into() is concise, I strongly discourage its use in non-generic, concrete code because it obscures intent. Prefer TargetType::from(source_value):

  • The destination type is explicit at the call site.
  • It immediately signals that a conversion is taking place and to which type.
  • It removes the need for type inference and reduces coupling to surrounding context, aiding readability and future refactoring.

.into() is better reserved for generic contexts or builder-style code where the target type is already constrained. Outside of those, the clarity benefit of from() is worth the extra characters.

Copy link
Member

@CBenoit CBenoit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highlighting this comment: https://github.com/Devolutions/devolutions-gateway/pull/1446/files#r2242524595

suggestion: Look in the config module, there is a enable_unstable option. For now, I suggest we gate this feature behind this option, so you can freely change the API as necessary until the 2025.3 release.

Comment on lines 12 to 14
use camino::*;
use chrono::{DateTime, TimeDelta, Utc};
use serde::*;
Copy link
Member

@CBenoit CBenoit Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: I highly discourage using glob imports. When done without precise conventions it leads to code harder to review and explore. There are some exceptions, but we should not decide in an ad-hoc way.

Comment on lines 189 to 213
#[derive(Eq, PartialEq, Hash, Clone, Serialize, Deserialize, Debug)]
#[serde(rename_all = "camelCase")]
pub enum ProbeType {
Ping,
TcpOpen,
#[serde(untagged)]
Unknown(String),
}

#[derive(Eq, PartialEq, Hash, Clone, Serialize, Deserialize, Debug)]
pub struct MonitorDefinition {
id: String,
probe: ProbeType,
address: String,
interval: u64,
timeout: u64,
port: Option<i16>,
}

#[derive(PartialEq, Clone, Serialize, Deserialize, Debug)]
pub struct MonitorResult {
monitor_id: String,
request_start_time: DateTime<Utc>,
response_success: bool,
response_messages: Option<String>,
response_time: f64,
}
Copy link
Member

@CBenoit CBenoit Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: So far, we’ve been separating serialization (models) from the actual business logic. In practice, this means that we don’t use serde in the "library" crates, and we have separate types for the internal representations, and the "Gateway API boundary representation" ("DTO"). The idea is that the crate should be used like any other library, and should not take decision on how the data is represented between the systems (e.g.: DVLS <-> Gateway) at this level.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good. My thought process was to keep everything tightly contained in my monitoring crate, but I see how keeping the DTOs in my crate goes against the current architecture. However I'm not sure how to proceed other than duplicating the structs and enums, would that be okay?

Copy link
Member

@CBenoit CBenoit Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that’s okay! They may or may not diverge at some point in the future based on our needs; the benefit is that you can refactor the internal structs without affecting the system API by mistake, and you can modify the system API without altering the internal details. IIRC, the network scanner went through something like this.

}

pub(crate) fn write(&self, data: T) {
let mut entries = self.entries.lock().expect("Couldn't get LogQueue lock");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: Rust errors should not be capitalized. Also, I suggest to stay concise for this specific failure point, explaining we got a poisoned mutex (highly unlikely, since we should not panic while holding the lock as far as I can see).

Suggested change
let mut entries = self.entries.lock().expect("Couldn't get LogQueue lock");
let mut entries = self.entries.lock().expect("poisoned");

Comment on lines 161 to 163
fn get_monitor_config_path(cache_path: &Utf8PathBuf) -> Utf8PathBuf {
cache_path.join("monitors_cache.json")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: The current logic assumes that cache_path is a directory and appends a hardcoded filename internally. This creates ambiguity and limits flexibility, as the consumer cannot control the final file path or name. It also introduces hidden behavior that is not obvious to user of the API.

suggestion: This logic should be delegated to the consumer. The library should treat cache_path as the full path to the cache file, allowing consumers to specify both the location and the name.

rationale: Since the library only manages a single cache file, there's no strong reason to impose a fixed internal filename. This change would improve API clarity and make the library more adaptable to various use cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree completely! My original code was reading the config dir path from the existing env variable, but I didn't rethink the implementation when I switched to passing the path through the state struct.

Comment on lines 15 to 16
tokio = { version = "1.45", features= ["full"] }
tokio-util = { version = "0.7", features = ["full"] }
Copy link
Member

@CBenoit CBenoit Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Do not use "full", especially in libraries, this brings in new dependencies and code we don’t need.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did notice that "full" is avoided in the codebase, but I wanted to ask: it feels tedious to toggle features with such granularity, why can't the compiler figure out what we're not using? I'm genuinely asking, probably I'm about to find out that the compiler is not that smart and that by doing this I'm going to murder our compile time due to monomorphisation.

Copy link
Author

@philippelatulippe philippelatulippe Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in the case of tokio, maybe we can enable "full"? Across our crates we currently enable:

fs, io-std, io-util, macros, net, parking_lot, process, rt, rt-multi-thread, signal, sync, time

I think that's every feature enabled by "full"!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consumers won't be including every crate in the workspace though.

I think you only need macros for network-monitor.

Copy link
Member

@CBenoit CBenoit Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s a fair point for tokio, but as a good practice I still encourage you to specify only what is needed.
Allan point also holds true: all the features are enabled for Jetsocat I think, but not necessarily for all the other binaries.

Also, for tokio-util you are adding extra features which are bringing extra crates into the dependency graph.

Of course, the compiler, and more specifically the linker is able to remove most of the dead code.
Typically, it’s not really a problem about what is in the final artifacts we deliver.

However:

  • I’m pretty sure the code still gets compiled first, before its removed by the linker
  • For each build from scratch, we need to download more and more crates
  • From scratch builds are done way more often than you may think (especially in the CI)
  • Our from scratch build time is indeed already not ideal today, and adding a few seconds to the mess may seem like not much, but that’s part of the reason why the builds are getting significant longer on the long term
  • I still have hopes to improve that someday, so this is saving time for future me
  • As an aside, we also track ALL the dependencies for each releases of the Gateway (SBOM), because we care about security issues found in the dependencies, and even though the situation can already be considered not ideal (many, many dependencies in there), I still prefer minimizing as much as possible the risk of being alerted about a vulnerability or a supply chain attack in the one crate we didn’t actually need

I do think that the feature "full" is very useful though, especially when you’re prototyping or exploring and you don’t know exactly the feature set you need yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something seems wrong here. Let’s update the OpenAPI stuff in a separate PR.

@@ -42,5 +41,9 @@ pub fn make_router<S>(state: crate::DgwState) -> axum::Router<S> {
);
}

if state.conf_handle.get_conf().enable_network_monitoring {
Copy link
Member

@CBenoit CBenoit Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Instead of adding a new configuration key, please use get_conf().debug.enable_unstable. This is the generic flag we use for unstable stuff. For stabilizing, we just drop the related if blocks.

@@ -42,5 +41,9 @@ pub fn make_router<S>(state: crate::DgwState) -> axum::Router<S> {
);
}

if state.conf_handle.get_conf().enable_network_monitoring {
router = router.nest("/jet/monitoring", monitoring::make_router(state.clone()));
Copy link
Member

@CBenoit CBenoit Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: For the API naming, what about /jet/monitor? This seems more aligned with the rest of the routes.

question: Would it make sense to put that under the /jet/net endpoint? E.g.: /jet/net/monitor? Or is it also intended to do something else outside of network monitoring?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I’m okay with adjusting the small details with OpenAPI in a follow up PR. It’s fine to just brush up the documentation. It’s something hard to get right on the first try.

@CBenoit CBenoit changed the title feat(dgw): Basic network monitoring feat(dgw): basic network monitoring Aug 19, 2025
@CBenoit
Copy link
Member

CBenoit commented Aug 19, 2025

note: If you have a Jira ticket, add a Issue: DGW-XXX footer in the main comment that we are using for changelog generation 🙂

irvingoujAtDevolution and others added 8 commits August 20, 2025 17:35
babby's first professional rust commit
Place network monitoring behind a config flag, add swagger docs, return partial errors, more idomatic from/into usages, separte DTO from internal datastructures, clean up Cargo.toml changes,
@philippelatulippe philippelatulippe force-pushed the network-monitor-config-endpoint branch from a0829fd to 9782e2b Compare August 20, 2025 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants