Profiling large jax scripts #32581

YannBerthelot · 2025-10-13T19:16:37Z

YannBerthelot
Oct 13, 2025

Hi,

Thank you maintainers for this awesome project.

I have developed my own library of reinforcement learning agents in Jax, it works incredibly well and allows to run the experiments end-to-end on GPU. However, due to my rather messy understand of Jax inner-workings I am sure I am still underperforming compared to what Jax could achieve. As a result, I would like to profile my whole code to identify the performance bottlenecks.

For a python program I would use cProfile, for jax programs however I am a bit puzzled. I have tried the https://docs.jax.dev/en/latest/profiling.html tutorial, however I run into several issues. It seems that tensorboard crashes for large traces, so I rescoped my analysis to only analyze a sub-function:

import jax
from jax import lax, profiler, tree_util
from functools import partial
import time
import os

def block_pytree(pytree):
    """Utility to force all pending JAX computations to complete."""
    return tree_util.tree_map(lambda x: x.block_until_ready(), pytree)

@jax.jit
def training_iteration():
    # NOTE: Assuming all internal helper functions (ant_env_config, get_buffer, etc.) 
    # and the training_iteration function are defined and JAX-compatible.
    
    
    env_config = ant_env_config()
    buffer = get_buffer(buffer_size=100, batch_size=32, n_envs=env_config.n_envs)
    SAC_state = get_SAC_state(env_config, buffer)
    gamma = 0.99
    tau = 0.005
    action_dim = 1
    recurrent = False
    agent_config = SACConfig(
        gamma=gamma, tau=tau, target_entropy=-1.0, learning_starts=5
    )
    log_frequency = 10

    # Initialize buffer state
    buffer_state = init_buffer(buffer, env_config)
    SAC_state = SAC_state.replace(
        collector_state=SAC_state.collector_state.replace(buffer_state=buffer_state)
    )
    n_timesteps = 1_000

    # Define a partial function for training_iteration
    training_iteration_scan = partial(
        training_iteration, # ASSUMED: This function uses @named_call internally
        env_args=env_config,
        mode="gymnax" if env_config.env_params else "brax",
        recurrent=recurrent,
        buffer=buffer,
        agent_config=agent_config,
        action_dim=action_dim,
        log_frequency=log_frequency,
        total_timesteps=n_timesteps,
    )

    final_state, _ = training_iteration_scan(SAC_state, None)

    return final_state

LOG_DIR = "/tmp/jax_tb_rl"
os.makedirs(LOG_DIR, exist_ok=True)
print(f"JAX Profiler logs will be written to: {LOG_DIR}")

print("Starting JIT Warmup...")
final_state_warmup = training_iteration()
_ = block_pytree(final_state_warmup)
print("Warmup complete. Waiting for 1 second...")
time.sleep(1) # Clear the profiler buffer

profiler.start_trace(LOG_DIR)

print("Running profiled training iterations (5 steps)...")
final_state_profiled = training_iteration()

_ = block_pytree(final_state_profiled)


profiler.stop_trace()
print("Trace captured. Use: tensorboard --logdir=/tmp/jax_tb_rl")

I do get a trace for this, however I do not seem to understand what to do with it, or if it worked in the first place:

Is this the expected output? If so, how should I proceed to identify some bottlenecks from there? It seems that it's mostly jax functions, and not my functions, is it not possible to see those?

I should also specify that, for debugging/profiling, I am fully on CPU. Should I switch to some computer where I have a GPU to properly profile? Sorry if the question is too broad, overall I would really love to learn the best practices to write and profile jax programs.

If the jax profiler is not the right tool for this, could someone maybe point me towards the proper tool for this type of analysis?

PS : I have skimmed through similar discussions on this topic, however most seemed to run into the same problems I do without getting closure, so I really think having some insight on this subject would benefit the community.

YigitElma · 2025-10-14T15:56:35Z

YigitElma
Oct 14, 2025

Hi,

I have previously tried couple profilers documented on the website. As you said, for large scripts, it might get very hard to digest. Although it is not the most trivial one, I found out that the NVidia Nsight profiler works well for my use case (on NVidia GPUs). There are couple ways, you can add custom annotations to a function or a portion of the script using nvtx,

import nvtx

# as decorator
@nvtx.annotate("my_func range", color="red")
def my_func():
    do_something()

# or context manager
with nvtx.annotate("for_loop", color="green"):
    do_something_else()

# or 
rng = nvtx.start_range(message="my_message", color="blue")
# ... do something ... #
nvtx.end_range(rng)

There are a number of traces you can add optionally (there might be slowdown due to extra traces), you specify them when you call your script on the command line via nsys profile. One example,

nsys profile --trace=nvtx,cuda,cusolver --output=reportname --force-overwrite=true python test_script.py

The official documentation has an exhaustive list of options. This command will create an reportname .nsys-rep file then you can open it in the Nvidia Nsight Systems app. If you run the same command again, --force-overwrite=True will overwrite the result to the same file.

Note: this can also be used by MPI jobs, and you can merge multiple reports into one.

As noted in Jax docs, jitted functions are opaque to any type of profiler. So, you cannot add an annotation to a subset of jitted function, but you can still see what the hardware is doing (resource usage, memory transfers, device copies etc).

Usually, jitted functions show up in the XLA traces with their name, but that part gets messy really quick since internal Jax functions are also jitted. If you have a rough idea of which part of the code is the bottleneck, you can start putting more nvtx flags in that region.

I had shared multiple Nsight screenshots in #29470 before.

I would also be interested in other people's experiences with other profilers for large scripts.

Hope this helps!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Profiling large jax scripts #32581

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Profiling large jax scripts #32581

Uh oh!

YannBerthelot Oct 13, 2025

Replies: 1 comment

Uh oh!

Uh oh!

YigitElma Oct 14, 2025

YannBerthelot
Oct 13, 2025

YigitElma
Oct 14, 2025