Skip to content

Conversation

@bouwkast
Copy link
Contributor

@bouwkast bouwkast commented Oct 17, 2025

What does this PR do?

THIS IS FROM CLAUDE CODE - REVIEW ACCORDINGLY

In #4985 we removed pii_safe to get error logs back into compliance that the messages that we send to intake are a set of known strings.

In this PR we implement a two-tier custom exception strategy for profiler C code errors to prevent PII in telemetry while preserving full local debugging information.

Exception Types:

  • ProfilingError: Constant error messages → included in telemetry (safe for fingerprinting)
  • ProfilingInternalError: Dynamic content (libdatadog errors, system state) → excluded from telemetry (prevents fingerprinting issues)

Changes:

  • Converted 48+ C error sites from RuntimeError to ProfilingError
  • Converted 13 C error sites with dynamic content to ProfilingInternalError
  • Updated telemetry logging to selectively include only ProfilingError messages

Motivation:

Following PR #4985 which removed the pii_safe parameter, we needed a way to safely include known-constant profiler error messages in telemetry without risking PII leaks from dynamic content.

This approach:

  • ✅ Enables telemetry fingerprinting and aggregation (constant messages only)
  • ✅ Maintains full debugging context locally (dynamic details preserved)
  • ✅ Prevents deduplication issues at both client and backend
  • ✅ Provides semantic clarity through exception type hierarchy

This was the initial recommendation/suggestion that spawned this

  • Remove the pii_safe option. Only invoke telemetry methods with PII safe arguments.
  • The Ruby profiler is the current user of pii_safe. We will ensure all its messages contain only know values.
  • In today's implementation, the profiler (which is in C, to ensure it can execute in memory-unsafe contexts like during GCs or in between Ruby threads context switching) can only safely communicate its errors to our telemetry code (which is in Ruby) through Ruby exceptions. This means that we populate a Ruby exception message with a string containing the profiler error, then use that exception message when reporting it to telemetry.
  • To ensure we are only reporting exception messages that we know for sure were created by us, we will introduce a custom exception class (e.g. DatadogProfilingException), and only report messages from that exception. For example, these exceptions will go from rb_raise(rb_eRuntimeError, ... to rb_raise(rb_eDatadogProfilingError, ...

Change log entry

None.

Additional Notes:

Builds on PR #4985. All profiler errors now use typed exceptions instead of generic RuntimeError, making error handling more explicit and safer for telemetry ingestion.

How to test the change?

I tried to update the tests

@github-actions github-actions bot added core Involves Datadog core libraries profiling Involves Datadog profiling labels Oct 17, 2025
@bouwkast bouwkast changed the base branch from master to steven/error-logs-remediation October 17, 2025 18:07
@datadog-datadog-prod-us1
Copy link
Contributor

datadog-datadog-prod-us1 bot commented Oct 17, 2025

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage
Patch Coverage: 100.00%
Total Coverage: 98.53% (+0.14%)

View detailed report

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 9e8e039 | Docs | Datadog PR Page | Was this helpful? Give us feedback!

@lloeki lloeki force-pushed the steven/error-logs-remediation branch from f40e845 to c8670d6 Compare October 21, 2025 13:49
Base automatically changed from steven/error-logs-remediation to master October 21, 2025 14:40
@bouwkast bouwkast force-pushed the steven/error-logs-remediation-custom-profiler-code branch from 8550883 to 3d908d7 Compare November 7, 2025 20:10
@github-actions
Copy link

github-actions bot commented Nov 7, 2025

Typing analysis

Note: Ignored files are excluded from the next sections.

Untyped methods

This PR introduces 1 partially typed method, and clears 1 partially typed method. It increases the percentage of typed methods from 53.12% to 53.15% (+0.03%).

Partially typed methods (+1-1)Introduced:
sig/datadog/profiling.rbs:20
└── def self.try_reading_skipped_reason_file: (?untyped file_api) -> ::String?
Cleared:
sig/datadog/profiling.rbs:14
└── def self.try_reading_skipped_reason_file: (?untyped file_api) -> ::String?

If you believe a method or an attribute is rightfully untyped or partially typed, you can add # untyped:accept to the end of the line to remove it from the stats.

Add ruby_helpers.h include to 8 C files that use datadog_profiling_error_class
and datadog_profiling_internal_error_class but were missing the header declaration.

This fixes the compilation error:
  error: 'datadog_profiling_error_class' undeclared

Files fixed:
- clock_id_from_pthread.c
- collectors_gc_profiling_helper.c
- collectors_stack.c
- collectors_thread_context.c
- encoded_profile.c
- libdatadog_helpers.c
- private_vm_api_access.c
- unsafe_api_calls_check.c
Move ruby_helpers.h include after private VM headers to avoid conflicts.
This file requires private VM headers to be included first before any
public Ruby headers, but ruby_helpers.h includes datadog_ruby_common.h
which includes ruby.h, causing header ordering conflicts.

Fixes compilation error: 'expected ')' before '==' token in RHASH_EMPTY_P'
Cannot include ruby_helpers.h in this file as it pulls in public Ruby headers
(via datadog_ruby_common.h) that conflict with private VM headers.

Instead, declare the exception class globals as extern, following the pattern
already established in this file for other declarations.

This fully resolves the header ordering compilation error.
Method was renamed from safe_exception_message to constant_exception_message
but the RBS signature file was not updated, causing Steep type errors.
@pr-commenter
Copy link

pr-commenter bot commented Nov 7, 2025

Benchmarks

Benchmark execution time: 2025-11-07 21:32:47

Comparing candidate commit f04f01d in PR branch steven/error-logs-remediation-custom-profiler-code with baseline commit 9c11f64 in branch master.

Found 0 performance improvements and 2 performance regressions! Performance is the same for 42 metrics, 2 unstable metrics.

scenario:profiling - Allocations (baseline)

  • 🟥 throughput [-323589.970op/s; -311052.662op/s] or [-6.146%; -5.908%]

scenario:tracing - Propagation - Datadog

  • 🟥 throughput [-2946.528op/s; -2871.924op/s] or [-9.301%; -9.065%]

The error method must be public but was accidentally made private when
constant_exception_message was added. Moving it before the private keyword
restores its public visibility.

Fixes test failure: NoMethodError: private method 'error' called
Serialization errors contain dynamic libdatadog content, so they should
raise ProfilingInternalError (not ProfilingError or RuntimeError).

Updated both the Ruby wrapper code and the test expectation to use
ProfilingInternalError consistently.

Fixes test failure expecting ProfilingError but getting RuntimeError.
@bouwkast bouwkast added the AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos label Nov 10, 2025
@bouwkast bouwkast changed the title [DO NOT MERGE] Implement custom exceptions raised from profiler code [WIP] Implement custom exceptions raised from profiler code Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos core Involves Datadog core libraries profiling Involves Datadog profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants