-
Notifications
You must be signed in to change notification settings - Fork 11
Add remote symbolication support with build-id and PC offset #324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add remote symbolication support with build-id and PC offset #324
Conversation
b41a6eb to
74c5410
Compare
Benchmarks [x86_64 wall]Parameters
See matching parameters
SummaryFound 0 performance improvements and 1 performance regressions! Performance is the same for 14 metrics, 23 unstable metrics.
|
Benchmarks [x86_64 cpu,wall,alloc,memleak]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 15 metrics, 23 unstable metrics. |
Benchmarks [x86_64 memleak,alloc]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 16 metrics, 22 unstable metrics. |
Benchmarks [x86_64 alloc]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 14 metrics, 24 unstable metrics. |
Benchmarks [x86_64 memleak]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 15 metrics, 23 unstable metrics. |
Benchmarks [x86_64 cpu]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 15 metrics, 23 unstable metrics. |
Benchmarks [x86_64 cpu,wall]Parameters
See matching parameters
SummaryFound 1 performance improvements and 0 performance regressions! Performance is the same for 14 metrics, 23 unstable metrics.
|
Benchmarks [aarch64 cpu]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 16 metrics, 22 unstable metrics. |
Benchmarks [aarch64 memleak,alloc]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 17 metrics, 21 unstable metrics. |
Benchmarks [aarch64 alloc]Parameters
See matching parameters
SummaryFound 1 performance improvements and 0 performance regressions! Performance is the same for 15 metrics, 22 unstable metrics.
|
Benchmarks [aarch64 wall]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 17 metrics, 21 unstable metrics. |
Benchmarks [aarch64 cpu,wall]Parameters
See matching parameters
SummaryFound 2 performance improvements and 0 performance regressions! Performance is the same for 15 metrics, 21 unstable metrics.
|
Benchmarks [aarch64 cpu,wall,alloc,memleak]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 16 metrics, 22 unstable metrics. |
Benchmarks [aarch64 memleak]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 17 metrics, 21 unstable metrics. |
1788096 to
55578ac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
Consolidates all documentation under the doc/ directory instead of having a separate docs/ directory. This matches the existing pattern where documentation files are in doc/. - Renamed docs/architecture/CallTraceStorage.md to doc/architecture/CallTraceStorage.md - Renamed docs/architecture/TLSContext.md to doc/architecture/TLSContext.md - Renamed docs/architecture/TlsPriming.md to doc/architecture/TlsPriming.md - Removed empty docs/ directory Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Updated references in ddprof-stresstest/README.md to point to the new doc/architecture/ location instead of doc/ directly. - Changed doc/CallTraceStorage.md to doc/architecture/CallTraceStorage.md This completes the consolidation of all documentation under the doc/ directory with proper subdirectory organization. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Rename all documentation files under doc/ to use consistent PascalCase naming convention for improved consistency across the codebase. Changes: - event-type-system.md → EventTypeSystem.md - MODIFIER_ALLOCATION.md → ModifierAllocation.md - profiler-memory-requirements.md → ProfilerMemoryRequirements.md - REMOTE_SYMBOLICATION.md → RemoteSymbolication.md Update references: - README.md: Update link to RemoteSymbolication.md Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Update remotesym argument parsing to match the robust pattern used by other BOOL flags (like mcleanup): - Explicitly handle false values: 'n', 'no', 'f', 'false', '0' - Explicitly handle true values: 'y', 'yes', 't', 'true', '1' - Handle no-value case: 'remotesym' alone enables the feature - Any other value defaults to enabled (consistent with mcleanup) This prevents accepting invalid values like "remotesym=yikes" and makes the behavior consistent with other boolean flags in the codebase. Added comprehensive test coverage: - Test no-value case (should enable) - Test numeric values (0 and 1) - Test "no" and "n" variants (should disable) - Test invalid values (document default behavior) Addresses review comments: - #324 (comment) - #324 (comment) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Address security and robustness issues in ELF build-ID extraction: 1. Integer overflow vulnerability (symbols_linux_dd.cpp) - Replace addition-based bounds check with subtraction pattern - Prevents integer wrapping with malicious n_namesz/n_descsz values - Progressive checks: header size, name size, desc size independently - Uses safe pattern: remaining = note_size - offset - sizeof(header) - Each component verified against remaining before subtraction 2. Test race condition (remotesymbolication_ut.cpp) - Replace fixed path "/tmp/not_an_elf" with mkstemp() - Use unique temporary file: "/tmp/not_an_elf_XXXXXX" - Eliminates race conditions in concurrent test environments - Avoids conflicts when tests run in parallel - Follows best practices for temporary file creation Addresses review comments: - #324 (comment) - #324 (comment) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Use portable PRIxPTR macro instead of %lx for formatting uintptr_t values to ensure correct behavior across all platforms. Changes: - Added #include <inttypes.h> for PRIxPTR macro - Changed format string from "0x%lx" to "0x%" PRIxPTR - Ensures correct formatting on Windows x64 where uintptr_t is unsigned long long, not unsigned long - Maintains compatibility across all platforms (Linux, macOS, Windows) The %lx format specifier assumes uintptr_t == unsigned long, which is not portable. On Windows x64, uintptr_t is unsigned long long, which would cause format string warnings or incorrect output. Addresses review comment: - #324 (comment) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Root Cause Analysis: OpenJ9 uses JVMTI-based CPU profiling when shouldUseAsgct() returns false, falling back to the j9_engine (profiler.cpp:1148-1157). JVMTI's GetAllStackTracesExtended() API only captures Java stack frames and does not include native frames from dynamically loaded libraries. Test Requirement: RemoteSymbolicationTest requires native frames from libddproftest.so to validate remote symbolication. The test calls RemoteSymHelper.burnCpu() and computeFibonacci() which invoke native methods, expecting these frames to appear in samples with build-ID and PC offset information. Observed Failure: Test consistently failed after 10 retry attempts with: - 48-52 samples collected per attempt - 0 frames from libddproftest found - Log shows: [TEST::INFO] J9[cpu]=jvmti - Library loaded successfully with valid build-ID This is a fundamental limitation of JVMTI-based profiling on OpenJ9, not a bug in remote symbolication. The feature works correctly on HotSpot JVMs which use signal-based profiling (perf_events/itimer) that captures both Java and native frames. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- flightRecorder.cpp: Fix incorrect "2025, 2026" to "2026" - arguments.cpp: Add Datadog copyright for modifications Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…dd.cpp Added detailed comments with links to official specifications: - ELF Specification from Linux Foundation - LSB ELF Note Section format - GNU build-id feature documentation - GNU binutils ld --build-id option - readelf(1) manual page Also added inline documentation of ELF note structure (Elf64_Nhdr) with field-by-field explanation and 4-byte alignment requirements per spec. This makes it easier for future developers to understand the build-id extraction implementation and verify correctness against specifications. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
rkennke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Looks mostly good, except I have question in one place.
| strcpy(_build_id, other._build_id); | ||
| } | ||
| } else { | ||
| _build_id = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That assignment is redundant, b/c build_id is initialized with nullptr. You could save the whole else-branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. This is definitely not needed.
|
|
||
| _plt_offset = other._plt_offset; | ||
| _plt_size = other._plt_size; | ||
| _debug_symbols = other._debug_symbols; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the exact same code block as a few lines above? If so, it would be better to extract this into a helper method to avoid duplication.
| } | ||
| } | ||
|
|
||
| void CodeCache::setBuildId(const char* build_id, size_t build_id_len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow I have the feeling here comes the same sequence of code, yet again (3rd time). Can this be unified (maybe the other 2 instances can be made to call this method)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a shared function to copy the relevant parts.
Remote symbolication was calling NativeFunc::mark() on RemoteFrameInfo* pointers, causing garbage memory reads. Added patch to check frame_bci type before calling mark(), ensuring it's only called on BCI_NATIVE_FRAME. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
/merge -m squash |
|
View all feedbacks in Devflow UI.
This pull request is not mergeable according to GitHub. Common reasons include pending required checks, missing approvals, or merge conflicts — but it could also be blocked by other repository rules or settings.
The expected merge time in
jaroslav.bachorik@datadoghq.com cancelled this merge request build |
rkennke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, much better. We need to coordinate on the patching stuff because of #333.
| ], | ||
|
|
||
| // Stack walker patches for ASan compatibility | ||
| // Stack walker patches for ASan compatibility and remote symbolication |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that I'm currently working on removing the patching stuff (#333 ), we need to be careful to avoid a mess here.
|
/remove |
|
View all feedbacks in Devflow UI.
|
What does this PR do?:
Adds remote symbolication support to the Java profiler, enabling native frames to be symbolicated remotely by backend services instead of locally by the agent.
Motivation:
How it works:
When remote symbolication is enabled (
remotesym=true), the profiler stores build-ID and PC offset information for native frames instead of resolving symbols locally. This raw addressing information is serialized into JFR format and sent to backend services for symbol resolution.Key Components:
Build-ID Extraction (
symbols_linux_dd.h/cpp)_build_id_processedset to prevent redundant extractionPacked Frame Data (
profiler.h/cpp,vmEntry.h)pc_offset (44 bits) | mark (3 bits) | lib_index (17 bits)Stack Walker Integration (
stackWalker_dd.h,gradle/patching.gradle)convertNativeTrace()→populateRemoteFrame()resolveNativeFrameForWalkVM()called during stack walkJFR Serialization (
flightRecorder.cpp/h)<remote>marker0x<offset>)Library Management (
libraries.h/cpp,codeCache.h/cpp)Configuration:
# Enable remote symbolication java -agentpath:libjavaProfiler.so=start,cpu,remotesym=true,file=profile.jfr MyAppObservability:
Three new counters track feature usage:
REMOTE_SYMBOLICATION_FRAMES: Frames using remote symbolicationREMOTE_SYMBOLICATION_LIBS_WITH_BUILD_ID: Libraries with extracted build-IDsREMOTE_SYMBOLICATION_BUILD_ID_CACHE_HITS: Build-ID cache efficiencyThread Safety:
_build_id_lockmutexlockAll()holdPerformance:
Platform Support:
Testing:
Test Coverage:
remotesymbolication_ut.cpp,remoteargs_ut.cppRemoteSymbolicationTest.java(all cstack modes: vm, vmx, fp, dwarf)libddproftest.sowith guaranteed build-idDocumentation:
Backward Compatibility:
For Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance.