Reference for the two Python entry points. For a guided tour with full workloads, see Python examples. For a 60-second introduction, see the Quickstart.
codegreen.Session— manual span-based measurement, imported and used directly in your code.- CLI auto-instrumenter — runs
codegreen measure ...over a script, injects checkpoints automatically.
Both share the same NEMB C++ backend, the same JSON output envelope, and the same libcodegreen-nemb.so ABI (v2+). They can coexist in one process.
For end-to-end examples, see Python examples → Manual measurement with codegreen.Session.
import codegreen
with codegreen.Session("training-run") as s:
with s.task("data_load"):
load_data()
with s.task("train"):
train_model()By default, results are written to codegreen_<pid>.json in the working directory. CSV is opt-in (pass output_file="x.csv" or output_format="csv"). Pass save_to_file=False to suppress file output.
Three usage forms are supported — context manager, explicit start_task / stop_task, and @codegreen.task decorator. Full code for each is in Python examples.
| Param | Default | Notes |
|---|---|---|
name |
"default" |
Session name written to output |
output_file |
codegreen_<pid>.json |
Output path; CSV chosen automatically when path ends in .csv |
output_format |
"auto" |
"auto" | "json" | "csv" | "none"; "auto" sniffs from extension, defaults to JSON |
save_to_file |
True |
Set False to suppress file writes entirely |
warn_on_concurrent |
True |
Warn at construction if another codegreen process is active on the same host (RAPL is system-wide) |
record_time_series |
False |
Capture sampled (timestamp, power, energy, per-domain) tuples for each task |
buffer_samples |
None |
Power-user override of the C++ ring-buffer size; usually unnecessary because Python drain is adaptive |
sample_interval_ms |
None (uses config.json) |
Per-session override of the sampler's measurement interval; routes to the existing coordinator.measurement_interval_ms field via nemb_set_measurement_interval_ms — no parallel state |
sampling_mode |
"fixed" |
"adaptive" is reserved for a future runtime-rate-control mode; today only "fixed" is implemented |
Top-level keys: meta, tasks (list of task dicts), totals. Every numeric field carries an explicit unit suffix (_j, _s, _w, _ns). Field names are identical between the Session API and codegreen run CLI output.
{
"meta": {
"schema_version": "1",
"codegreen_version": "0.4.8",
"run_id": "b7856b409d72",
"session_name": "training-run",
"started_at": "2026-05-10T18:16:56.209074+00:00",
"ended_at": "2026-05-10T18:17:01.345702+00:00",
"started_at_local": "2026-05-10T11:16:56.209074-07:00",
"ended_at_local": "2026-05-10T11:17:01.345702-07:00",
"host_timezone": "PDT",
"duration_total_s": 5.137,
"hostname": "amd-epyc-9554p",
"pid": 12345,
"platform": "linux",
"python_version": "3.13.0",
"cpu_model": "AMD EPYC 9554P 64-Core Processor",
"kernel": "Linux-5.15.0-...",
"cwd": "/home/user/work",
"argv": ["script.py"],
"codegreen_env": {"CODEGREEN_LIB_PATH": "..."},
"measurement_quality": "ok",
"domain_support": "full",
"outlier_method": "iqr_1.5",
"iso_timestamp_format": "rfc3339_utc",
"nemb_abi_version": 3,
"domain_topology": {
"package-0": {"top_level": true, "kind": "cpu_package", "includes": ["core"]},
"core": {"top_level": false, "kind": "nested", "includes": []},
"gpu0": {"top_level": true, "kind": "gpu", "includes": []}
},
"timeseries": {"enabled": true, "schema_version": "1",
"sample_keys": ["t_ns", "energy_j", "power_w", "domains"],
"t_ns_clock": "clock_monotonic",
"inclusive_of_children": true}
},
"tasks": [
{"name": "data_load", "depth": 0, "parent": null,
"energy_j": 12.4, "avg_power_w": 4.0, "duration_s": 3.1,
"started_at": 1714155600.123, "ended_at": 1714155603.234,
"started_at_mono_ns": 20364878312447553, "ended_at_mono_ns": 20364881412447553,
"domains": {"package-0": 10.2, "core": 0.8, "gpu0": 1.4},
"domains_power_w": {"package-0": 3.29, "core": 0.26, "gpu0": 0.45},
"timeseries": [/* {t_ns, energy_j, power_w, domains}, ... */]}
],
"totals": {
"energy_j": 857.4,
"duration_s": 123.1,
"wall_duration_s": 125.5,
"task_duration_s": 123.1,
"gap_duration_s": 2.4,
"concurrent_overlap_s": 0.0,
"n_tasks": 2,
"n_top_level_tasks": 2,
"domains": {"package-0": 705.1, "core": 56.2, "gpu0": 96.1},
"domains_power_w": {"package-0": 5.73, "core": 0.46, "gpu0": 0.78},
"sample_interval_ms": 10,
"worst_within_task_power_cv_percent": 7.25,
"noise_warnings": []
}
}| Field | Meaning |
|---|---|
schema_version |
output-schema version. Bump indicates a breaking field rename or removal |
codegreen_version |
installed library version |
run_id |
12-hex-char UUID4 prefix; unique per process invocation, for log correlation |
session_name |
the Session(name=…) argument; null for CLI runs |
started_at / ended_at |
RFC 3339 UTC timestamp with +00:00 offset, microsecond precision. The canonical correlation key — use this for joins, sorts, and cross-machine comparisons |
started_at_local / ended_at_local |
(v0.4.8+) Same instant rendered in the host's local timezone with its offset (e.g. -07:00). Display-only companion; never use for joins. UTC and local always describe the same instant within microseconds |
host_timezone |
(v0.4.8+) Local timezone label at measurement time (e.g. PDT, ADT, +05:30 for non-DST regions) |
duration_total_s |
monotonic-clock delta from session start to report build (NTP-immune) |
hostname, pid, platform, python_version |
process & host identity |
cpu_model, kernel |
hardware/OS reproducibility metadata |
cwd, argv |
working directory & argv at measurement time |
codegreen_env |
snapshot of all CODEGREEN_* environment variables |
measurement_quality |
ok | no_tasks | no_backend | energy_zero | failed | checkpoints_only |
domain_support |
full (per-domain breakdown) | scalar_only (overall energy only) | none (no backend) |
outlier_method |
which outlier filter was applied to multi-run statistics (default "iqr_1.5") |
iso_timestamp_format |
format contract for started_at/ended_at; pin in case future versions change it |
nemb_abi_version |
C++ NEMB backend ABI version actually loaded |
domain_topology |
machine-readable domain nesting (so consumers know which keys are top-level vs. nested) |
timeseries |
block describing whether timeseries was recorded + its sample schema |
task_duration_s is the sum of depth-0 task durations (matches energy_j's window); wall_duration_s is s.start()→s.stop() from monotonic clock; gap_duration_s = wall − union(task intervals) (uninstrumented work between tasks); concurrent_overlap_s is positive when tasks ran in parallel threads. domains_power_w[d] is the energy-weighted average power: Σenergy_d / Σduration_over_tasks_where_d_was_reported (so a domain present on only some tasks is not diluted).
avg_power_w = energy_j / duration_s. domains_power_w[d] = domains[d] / duration_s per task. started_at_mono_ns/ended_at_mono_ns (added v0.4.7) let consumers align task windows with timeseries[].t_ns exactly. The parent field is the immediately-enclosing task name when nested.
domains— per-domain RAPL/NVML energy (J) for the task, computed atomically with the session stop (ABI v2 — race-free under concurrent threads).domains_power_w— per-domain average power (W), computed asdomains[d] / duration_s. Same time-base asavg_power_wso the two are directly comparable.- Domain nesting caveat: domain energies are NOT disjoint. On Intel,
packagealready includespp0/coreandpp1(uncore/igpu);drammeasures a physically-separate counter (Intel SDM Vol 4 §14.9 — MSR 0x611 vs MSR 0x619);gpu*(NVML) is fully independent. On AMD EPYC, onlypackage-0is exposed. Sosum(domains.values()) ≠ energy_jby design —energy_jaggregatespackage + dram + gpuand excludes thepp0/pp1/core/uncoresubsets. Usemeta.domain_topologyto programmatically distinguish top-level from nested domains. - DRAM is always included (v0.4.6+): Linux exposes DRAM at
intel-rapl:0/intel-rapl:0:0/name=dramon Skylake-SP+ Xeons (sub-zone) and atintel-rapl:1/name=dram-0on older Xeons (zone-level). v0.4.6 promotes both layouts equivalently into theenergy_jtotal — earlier versions undercounted by 10-15% on memory-bound workloads on Skylake-SP+ chips. timeseries— present only whenrecord_time_series=True(ABI v3+). Each sample is self-describing:
| Key | Type | Unit | Meaning |
|---|---|---|---|
t_ns |
int |
nanoseconds | CLOCK_MONOTONIC timestamp at sample (Linux); mach_continuous_time on macOS; QueryPerformanceCounter on Windows — all converted to ns |
energy_j |
float |
joules | system-wide cumulative energy from session start (sum across all providers) |
power_w |
float |
watts | system-wide instantaneous power at this sample (sum across all domains) |
domain_j |
Dict[str,float] |
joules | per-domain cumulative energy from session start (e.g. package-0, core, dram, gpu0) |
domain_w |
Dict[str,float] |
watts | per-domain average power since the previous sample. Domains whose provider does not expose per-domain power (Darwin IOReport, Windows EMI, AMD RAPL) are absent rather than reported as 0, so callers can distinguish "0 W" from "not measured" |
So to get only GPU watts directly: [s["domain_w"].get("gpu0", 0.0) for s in ts].
| Field | Type | Meaning |
|---|---|---|
name |
str |
task name passed to start_task / task() |
energy_j |
float |
total joules during the task (atomic via nemb_stop_session_v2) |
avg_power_w |
float |
average watts over the task window (= energy_j / duration_s) |
duration_s |
float |
task wall-clock seconds (monotonic-derived) |
started_at, ended_at |
float |
wall-clock POSIX seconds (display only) |
started_at_mono_ns, ended_at_mono_ns |
int |
monotonic-clock stamps for aligning with timeseries[].t_ns (v0.4.7+) |
depth, parent |
int, Optional[str] |
nesting info; parent is the immediately-enclosing task name |
domains |
Dict[str, float] |
per-RAPL/NVML domain energy (J) for the task |
domains_power_w |
Dict[str, float] |
per-domain average power (W) = domains[d] / duration_s. Same time-base as avg_power_w. |
timeseries |
Optional[List[Dict]] |
sorted, deduplicated samples within [started_at_mono_ns, ended_at_mono_ns]. None when record_time_series=False; empty list when enabled but the task was shorter than one sample interval. Inclusive of children (a parent's timeseries contains its children's samples — see meta.timeseries.inclusive_of_children). |
noise |
Optional[Dict] |
quality summary computed from timeseries |
See the timeseries-sample schema table above for sample keys (t_ns, energy_j, power_w, domain_j, domain_w).
When record_time_series=True, every task carries a noise dict and totals carry a roll-up:
"noise": {
"samples_captured": 2847,
"samples_expected": 3000,
"samples_expected_method": "observed_median",
"drop_ratio": 0.0510,
"power_mean_w": 102.3,
"power_std_w": 7.4,
"power_cv_percent": 7.25,
"sample_interval_ms": 1,
"quality": "moderate"
},
"totals": {
...,
"worst_within_task_power_cv_percent": 7.25,
"noise_warnings": [
{"task": "data_load", "depth": 0,
"within_task_power_cv_percent": 17.8, "drop_ratio": 0.003,
"quality": "high-noise",
"reasons": ["within_task_power_cv_above_10pct"]}
]
}samples_expected_method is "observed_median" (interval inferred from captured samples; default when n ≥ 3) or "configured" (falls back to sample_interval_ms). quality is bucketed by power_cv_percent: excellent <2 %, good <5 %, moderate <10 %, high-noise ≥10 %. A RuntimeWarning is emitted (and the task is appended as a structured record to totals.noise_warnings) when CV ≥10 % or drop_ratio ≥20 %. All thresholds live in config.json under measurement.report.noise_warning so they can be overridden without code changes. Computation runs once at stop() time and adds ~0.05 % bias vs record_time_series=False.
Note — slight overhead when record_time_series=True.
The drain thread that pulls samples out of the C++ ring buffer is cheap but not free. On reproducibility benchmarks (3 fresh subprocesses each, identical workload):
- The mean energy/duration is unchanged:
record_time_series=Truevs=Falseagreed to ≤ 0.3 % (within run-to-run jitter). - The run-to-run spread is slightly wider with sampling on (CV of total energy ~5 % vs ~1 % off) because the drain wakes up at irregular intervals and competes briefly with the workload for CPU.
So enabling time-series gives you per-sample power, plot export and the noise/quality summary, at the cost of a marginally noisier individual total. Best-of-both-worlds: use it during development to inspect power traces and pick the right code regions, then turn it off for production benchmark runs where you want the tightest possible run-to-run CV.
record_time_series=True collects samples at the coordinator's configured rate (config.json's coordinator.measurement_interval_ms, default 1 ms on this build). The Session.export_plot(path) helper renders a power-vs-time chart per task; area under the curve equals the task's energy.
with codegreen.Session("training", record_time_series=True) as s:
with s.task("epoch1"): train_one_epoch()
with s.task("epoch2"): train_one_epoch()
s.export_plot("training.html") # Plotly (interactive)
s.export_plot("training.png") # Matplotlib (static image)Numerically, integrating w(t) over a task's window with the trapezoidal rule recovers the NEMB-reported energy_j to within ~0.2% (verified on a 5 s task with ~4,800 samples).
The C++ sampling ring buffer is fixed-size (default 1000 samples — at the default 1 ms interval that's a ~1 s window; with sample_interval_ms=10 it's a ~10 s window, etc.). To prevent silent loss on long tasks, the Session runs a Python drain thread that pulls samples out faster than the buffer rotates. Drain is adaptive:
- starts at 0.5 s,
- halves to a 50 ms floor when buffer >50% saturated on a single drain pass,
- doubles to a 2 s ceiling when <10% for three consecutive drains,
- emits a warning at >90% saturation suggesting
buffer_samplesoverride.
Verified on a 30-second task with defaults only: 28,460 samples, full span, zero gaps >50 ms.
Pre-existing: config.json's coordinator.measurement_interval_ms is the startup default (loaded by nemb::ConfigLoader::load_config()).
Per-session override: pass sample_interval_ms=N to Session(...) — it calls nemb_set_measurement_interval_ms which writes the same config_.measurement_interval field the sample loop reads. No parallel sampling-rate state, no duplicate config parsing.
- Single session per process. Constructing a second
Sessionwhile one is active raisesRuntimeError. - Mismatched stops raise
RuntimeErrorwith the actual innermost task name. - Forgotten
.stop()is recovered by anatexithook — the file is still written, the JSON envelope still emitted. - Concurrent threads can each maintain their own task stack (per-thread).
nemb_stop_session_v2makes domain breakdown race-free. - Forked children become no-ops automatically; only the parent process reports.
- No NEMB lib loaded (CodeGreen built without C++ backend) → Session degrades to a warning + zero-energy results, your program still runs.
RAPL counters are system-wide, not per-process. If two CodeGreen sessions overlap in wall time on the same socket, both readings include the other's energy (double-counting). The Session constructor warns when it detects another live CodeGreen pid via $XDG_RUNTIME_DIR/codegreen-<uid>.pids. For benchmarks, run sequentially or accept "system energy during this window" semantics.
codegreen/instrumentation/language_runtimes/python/codegreen_runtime.py
This module is injected into instrumented code automatically. It uses ctypes to call libcodegreen-nemb.so.
def checkpoint(checkpoint_id: str, name: str, checkpoint_type: str):
"""Mark a checkpoint in the energy measurement stream."""Called by instrumented code at function boundaries:
from codegreen_runtime import checkpoint
checkpoint(checkpoint_id="1", name="my_function", checkpoint_type="enter")
# ... function body ...
checkpoint(checkpoint_id="2", name="my_function", checkpoint_type="exit")Each call records a ~100ns timestamp signal. The NEMB backend tracks invocations automatically (#inv_N suffix).
def measure_checkpoint(checkpoint_id: str, checkpoint_type: str,
name: str, line_number: int, context: str):
"""Record a checkpoint marker with full metadata."""Lower-level function with additional context. checkpoint() delegates to this.
At process exit (atexit), the runtime prints checkpoint data to stdout:
--- CODEGREEN_RESULT_START ---
{"measurements": [
{"checkpoint_id": "enter:main:1#inv_1_t...", "timestamp": 13973..., "joules": 6.80, "watts": 0.76},
{"checkpoint_id": "exit:main:2#inv_1_t...", "timestamp": 13973..., "joules": 8.91, "watts": 71.94}
]}
--- CODEGREEN_RESULT_END ---
The CLI parses this output to extract measurement results.
These commands drive the auto-instrumenter; the Quickstart and CLI reference cover them in full:
codegreen measure python script.py # basic
codegreen measure python script.py -g fine --export-plot energy.html
codegreen measure python script.py --json
codegreen analyze python script.py --save-instrumented --output-dir ./outcodegreen/
cli/cli.py # Typer CLI
instrumentation/
engine.py # MeasurementEngine
language_engine.py # Tree-sitter parsing + query matching
ast_processor.py # Checkpoint injection
configs/*.json # Language-specific instrumentation configs
language_runtimes/
python/codegreen_runtime.py # Python ctypes bridge to NEMB + Session
java/CodeGreenRuntime.java # Java JNI bridge to NEMB
analyzer/plot.py # Plotly / matplotlib visualization
measurement/src/nemb/
codegreen_energy.cpp # C API + EnergyMeter implementation