Device agnostic for DCP #19

Chao1Han · 2025-07-14T07:59:41Z

Enable device-agnostic implementation of DCP-related functionality, allowing the new DCP features to be supported on XPU as well.
use_cuda_non_blocking_copy to use_non_blocking_copy because non-blocking copy is supported by most GPUs and is not exclusive to CUDA devices.

Test plan: test cases have not yet been updated to be fully device agnostic; this will be addressed in future work.

zhangxiaoli73 · 2025-07-15T06:32:31Z

torch/distributed/checkpoint/_experimental/staging.py

                # Note: stream needs to be initialized on the main thread after default cuda
                # stream is setup/used to avoid the risk of accidentally reusing the main
                # compute stream or in other cases kernels actually launching from the
                # main thread.
-                self._staging_stream = torch.cuda.Stream()
+                self._staging_stream = torch.Stream()


what will happen to do sync with such stream?

When set non-blocking=True, this stream ensure copy done. isn’t used anywhere else.

zhangxiaoli73 · 2025-07-15T06:33:27Z

torch/distributed/checkpoint/staging.py

-            assert torch.cuda.is_available(), "Non-blocking copy requires CUDA"
+        if self._config.use_non_blocking_copy:
+            assert torch.accelerator.is_available(), (
+                "Non-blocking copy requires CUDA/XPU"


What's the limitation to require non-blocking? Must to be accelerator?

non-blocking only used for copy between CPU and GPU, I assume all accelerator support.

This reverts commit 94d7f0c. Reverted pytorch#158475 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#158475 (comment)))

I think the main one that was missing is dynamo_wrapped There's also slow and inductor, but the filter later for workflows stops TD from running on those anyways dynamo_wrapped is the second longest jobs for pull right now <img width="1265" height="311" alt="image" src="https://github.com/user-attachments/assets/d4ca034c-a8f0-4b31-a80f-0f4f21fce32a" /> Pull Request resolved: pytorch#158163 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi

…ytorch#158492) This might help some legacy models that still have inline_inbuilt_nn_modules False for some reason. Pull Request resolved: pytorch#158492 Approved by: https://github.com/StrongerXi

… FakeTensorMode (pytorch#157931) We already test the `_get_offset` functionality with that TORCH_SERIALIZATION_DEBUG flag that is set in CI, so I didn't add more testing specifically for FakeTensor Pull Request resolved: pytorch#157931 Approved by: https://github.com/albanD

Fixes ``` /var/lib/jenkins/workspace/torch/csrc/dynamo/guards.cpp:5320:10: error: compound assignment to object of volatile-qualified type 'volatile char' is deprecated [-Werror,-Wdeprecated-volatile] ``` Pull Request resolved: pytorch#158435 Approved by: https://github.com/janeyx99

It seems that `#include <vector>` is being pulled in indirectly, but it is being used directly, so it is best to explicitly include it. Pull Request resolved: pytorch#158354 Approved by: https://github.com/janeyx99

…ations.wrapper_set_seed (pytorch#158548) Test modules that depend on the original definition of `wrapper_set_seed` will inadvertently be affected if they import from test_torchinductor_opinfo.py. Additionally, using pytest `test_torchinductor_opinfo.py test_other_module.py` when run in the same process may affect the test behaviour of `test_other_module.py` if the tests depend on `wrapper_set_seed`. Pull Request resolved: pytorch#158548 Approved by: https://github.com/janeyx99

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#158596 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <[email protected]>

Summary: The shims for aten ops are now generated by torchgen. But there are some still old APIs in `aoti_torch/c/shim.h` This diff moves the old to-be-deprecated APIs for aten ops to a separate header file `shim_deprecated.h` The to-be-deprecated APIs are determined by comparing APIs in `shim.h` and ops in `fallback_ops.py` Test Plan: CI Rollback Plan: Differential Revision: D78378373 Pull Request resolved: pytorch#158400 Approved by: https://github.com/jingsh, https://github.com/desertfire

…#158509) As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to an important file in dynamo, `decorators.py` NOTE: Untyped fns are because there is a conflict with `__init__.py` in compiler so we can't type these at this time Running ``` mypy torch/_dynamo/decorators.py --linecount-report /tmp/coverage_log ``` | -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered | | -------- | ------- | -------- | ------- | ------- | ------- | ------- | | Main | 209 | 908 | 23.02% | 9 | 39 | 23.08% | | This PR | 870 | 943 | 100.00% | 36 | 39 | 100.00% | | Delta | +661 | +35 | +76.98% | +27 | 0 | +76.92% | Pull Request resolved: pytorch#158509 Approved by: https://github.com/williamwen42

- Skip `test_index_put_accumulate_large_tensor_mps` as it crashes with ``` /com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:829: failed assertion `[MPSNDArray initWithDevice:descriptor:isTextureBacked:] Error: NDArray dimension length > INT_MAX' ``` while running `torch.ones([2**31+5], dtype=torch.int8, device='mps')` - Adjust types for `test_index_put_src_datatype` as index_put on MPS is not implemented for complex (yet) - Adjust `test_index` to avoid using DoubleTensors for MPS Pull Request resolved: pytorch#158582 Approved by: https://github.com/dcci, https://github.com/Skylion007, https://github.com/manuelcandales

If you reinstall numpy after having installed pandas, it will error out sometimes if the versions are different enough (see below snippet). This change forces pandas to be reinstalled when installing numpy. It doesn't work in a separate pip call, because then pip takes the version of numpy requested by pandas as the one to install, undoing the command in the first place. ``` (numpy_pandas) [[email protected] ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip list Package Version ------------------ ----------- attrs 25.3.0 build 1.2.2.post1 certifi 2025.7.14 charset-normalizer 3.4.2 cmake 4.0.3 exceptiongroup 1.3.0 expecttest 0.3.0 filelock 3.18.0 fsspec 2025.5.1 hypothesis 6.135.32 idna 3.10 importlib_metadata 8.7.0 Jinja2 3.1.6 lintrunner 0.12.7 MarkupSafe 2.1.5 mpmath 1.3.0 networkx 3.2.1 ninja [1.11.1.4](https://www.internalfb.com/phabricator/paste/view/1.11.1.4) opt-einsum 3.3.0 optree 0.16.0 packaging 25.0 pip 25.1 psutil 7.0.0 pyproject_hooks 1.2.0 python-dateutil 2.9.0.post0 pytz 2025.2 PyYAML 6.0.2 requests 2.32.4 setuptools 78.1.1 six 1.17.0 sortedcontainers 2.4.0 sympy 1.14.0 tomli 2.2.1 typing_extensions 4.14.0 tzdata 2025.2 urllib3 2.5.0 uv 0.7.21 wheel 0.45.1 zipp 3.23.0 (numpy_pandas) [[email protected] ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip install numpy==1.22.4 Collecting numpy==1.22.4 Using cached numpy-1.22.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.0 kB) Using cached numpy-1.22.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB) Installing collected packages: numpy Successfully installed numpy-1.22.4 (numpy_pandas) [[email protected] ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip install pandas==2.0.3 Collecting pandas==2.0.3 Using cached pandas-2.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB) Requirement already satisfied: python-dateutil>=2.8.2 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (2025.2) Requirement already satisfied: tzdata>=2022.1 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (2025.2) Requirement already satisfied: numpy>=1.20.3 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (1.22.4) Requirement already satisfied: six>=1.5 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from python-dateutil>=2.8.2->pandas==2.0.3) (1.17.0) Using cached pandas-2.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB) Installing collected packages: pandas Successfully installed pandas-2.0.3 (numpy_pandas) [[email protected] ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip install --pre numpy==2.0.2 Collecting numpy==2.0.2 Using cached numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB) Using cached numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.5 MB) Installing collected packages: numpy Attempting uninstall: numpy Found existing installation: numpy 1.22.4 Uninstalling numpy-1.22.4: Successfully uninstalled numpy-1.22.4 Successfully installed numpy-2.0.2 (numpy_pandas) [[email protected] ~/pt-envs/at (exclamaforte/just-gemm-model)]$ python Python 3.9.23 (main, Jun 5 2025, 13:40:20) [GCC 11.2.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pandas Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/__init__.py", line 22, in <module> from pandas.compat import is_numpy_dev as _is_numpy_dev # pyright: ignore # noqa:F401 File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/compat/__init__.py", line 25, in <module> from pandas.compat.numpy import ( File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/compat/numpy/__init__.py", line 4, in <module> from pandas.util.version import Version File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/util/__init__.py", line 2, in <module> from pandas.util._decorators import ( # noqa:F401 File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/util/_decorators.py", line 14, in <module> from pandas._libs.properties import cache_readonly File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/_libs/__init__.py", line 13, in <module> from pandas._libs.interval import Interval File "pandas/_libs/interval.pyx", line 1, in init pandas._libs.interval ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject ``` Pull Request resolved: pytorch#158584 Approved by: https://github.com/huydhn

Fixes pytorch#158054 Pull Request resolved: pytorch#158312 Approved by: https://github.com/albanD

Fixes pytorch#158374 Pull Request resolved: pytorch#158424 Approved by: https://github.com/Valentine233, https://github.com/drisspg, https://github.com/atalman

Pull Request resolved: pytorch#158495 Approved by: https://github.com/zpcore, https://github.com/XilunWu

…l_options_cuda` (pytorch#158494) Otherwise fails with ``` torch._inductor.exc.InductorError: RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_tem_fused__to_copy_ones_sort_sum_zeros_2 Required: 264224 Hardware limit: 232448 Reducing block sizes or `num_stages` may help. ``` Pull Request resolved: pytorch#158494 Approved by: https://github.com/drisspg

…ert_frame (pytorch#158379) As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo This PR adds strict typing support to a critical tracing point for dynamo, primarily for`comptime.py` but also `cache_size.py` and `convert_frame.py`. Running ``` mypy torch/_dynamo/comptime.py torch/_dynamo/cache_size.py torch/_dynamo/convert_frame.py --linecount-report /tmp/coverage_log ``` | -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered | | -------- | ------- | -------- | ------- | ------- | ------- | ------- | | Main | 1837 | 2215 | 82.93% | 45 | 82 | 54.88% | | This PR | 2230 | 2230 | 100.00% | 82 | 82 | 100.00% | | Delta | +393 | +15 | +17.07% | +37 | 0 | +45.12% | Pull Request resolved: pytorch#158379 Approved by: https://github.com/mlazos

…#158469) Include both the error stacktrace and the graphmodule in a new structured trace artifact. Log the shortened version to the console, and also log a hint to look at the tlparse for more. Pull Request resolved: pytorch#158469 Approved by: https://github.com/ezyang

Pull Request resolved: pytorch#158481 Approved by: https://github.com/d4l3k ghstack dependencies: pytorch#158469

…ch#158587) Summary: makeProxyExecutor shouldn't be exposed to ModelRunner Interface. Test Plan: CI Rollback Plan: Differential Revision: D78501011 Pull Request resolved: pytorch#158587 Approved by: https://github.com/yiming0416, https://github.com/henryoier

…ance_tracing=True` (pytorch#158576) Summary: - Split `create_mapping` to `create_mapping_pre_post_grad_nodes` and ` create_node_mapping_kernel_to_post_grad` - Store a mapping from pre_grad graph node names to stack traces in `_inductor_pre_grad_node_stack_trace` - Add `stack_traces` member to ir.Node and add it to the string representation of ir.Node - When we create an IR node, if `inductor.config.trace.provenance_tracing=True`, we populate `stack_traces` from `origins`. The nodes in `origins` are post_grad graph nodes. If a node has `node.stack_trace`, we store the stack_trace directly. This is particularly important for backward graph nodes because they don't have a mapping to pre-grad graph nodes. If a node doesn't have `.stack_trace ` (such as `linear`-> `addmm` nodes), we use the stack trace of the pre_grad graph nodes that it maps to. - A post grad graph node might not have stack trace if it correspond to multiple pre grad graph nodes, e.g. [GroupLinearFusion](https://github.com/pytorch/pytorch/blob/a00442421a14448f95fc28790325f941662d97f2/torch/_inductor/fx_passes/group_batch_fusion.py#L299) Example: ``` scheduling ExternKernelOut( python_kernel_name='extern_kernels.mm', name=buf0, layout=FixedLayout('cuda:0', torch.float32, size=[8, 16], stride=[16, 1]), inputs=[InputBuffer(name='arg2_1', layout=FixedLayout('cuda:0', torch.float32, size=[8, 10], stride=[10, 1])), ReinterpretView( StorageBox( ConstantBuffer(name='fc1_weight', layout=FixedLayout('cuda:0', torch.float32, size=[16, 10], stride=[10, 1])) ), FixedLayout('cuda:0', torch.float32, size=[10, 16], stride=[1, 10]), origins=OrderedSet([mm_default_1]), stack_traces = {, File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/scripts/shangdiy/aot.py", line 29, in forward, x = self.fc1(x), File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/torch/nn/modules/linear.py", line 125, in forward, return F.linear(input, self.weight, self.bias), } )], constant_args=(), kwargs={}, output_view=None, python_kernel_name=extern_kernels.mm, cpp_kernel_name=at::mm_out, ordered_kwargs_for_cpp_kernel=(), op_overload=None, arg_properties=[{}, {}], allarg_properties={}, kwarg_properties=None, unbacked_bindings={}, mutation_outputs=[], origin_node=mm_default_1, origins=OrderedSet([mm_default_1]), stack_traces = {, File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/scripts/shangdiy/aot.py", line 29, in forward, x = self.fc1(x), File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/torch/nn/modules/linear.py", line 125, in forward, return F.linear(input, self.weight, self.bias), } ) ``` Test Plan: ``` buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing ``` Rollback Plan: Differential Revision: D78365534 Pull Request resolved: pytorch#158576 Approved by: https://github.com/angelayi

Fixes several bugs in the original. - foremost, fixes a serious bug where we returned incorrect strategies by mixing input_specs that were frozen from select_strategy.strategies[0] with output_specs that varied across select_strategy.strategies[0..N] (e.g. we could create a nonsense strategy like input:Shard(0) output(Replicate) for an op like clone - fixes the redistribute costs: they should not actually be 0, they should be the cost of redistributing our single input from another strategy to the current strategy, in our list of output strategies - adds a note, wondering if we should have just literally returned the input strategy instead of creating this new object - Currently, using default_strategy is incorrect becuase it maps 'self' tensor's strategies directly onto 'src' tensor without accounting for the fact that copy_ supports broadcasting a smaller rank tensor into a larger one. Separates out copy_ op from default strategy, adds missing test case, but does not fix the underlying issue with copy_, leaves that for future PR Renames to `propagate_single_input_strategy` since that's more descriptive Pull Request resolved: pytorch#158490 Approved by: https://github.com/wanchaol, https://github.com/XilunWu ghstack dependencies: pytorch#158495

This reverts commit 2ecf083. Reverted pytorch#158072 on behalf of https://github.com/jeffdaily due to fails on rocm, signal ignored while rocm was unstable ([comment](pytorch#158072 (comment)))

) fbgemm_gpu build started failing with asmjit errors. Moving to latest tip of fbgemm for inductor tests resolves the build failures. Pull Request resolved: pytorch#158602 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <[email protected]>

…#158427) This PR is a bit more involved but effectively works to drastically simplify PyObjectSlot and PyInterpreter. 1) For PyObjectSlot we now use a global pyinterpreter since there only is one. From here we change all of the call sites to rely on this assumption. 2) We also remove the "tags" of the PyInterpreter by deprecating `PyInterpreterStatus`. For the reviewer, sadly it seems like `functorch/csrc/dim/dim.cpp` needed to get linted, so there is an unreadable amount of changes there. Fortunately, the only actual change in the file is as follows which just removes `getPyInterpreter()` from the `check_pyobj` call. ``` mpy::handle handle_from_tensor(Arena& A, TensorRef t) { - // fast case: tensor is live in python - std::optional<PyObject*> mb_obj = - t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(getPyInterpreter(), /*ignore_hermetic_tls=*/false); - if (mb_obj.has_value() && !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) { - return *mb_obj; - } - return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(*t))); -} -} + // fast case: tensor is live in python + std::optional<PyObject*> mb_obj = + t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj( + /*ignore_hermetic_tls=*/false); + if (mb_obj.has_value() && + !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) { + return *mb_obj; + } + return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(*t))); +} ``` Pull Request resolved: pytorch#158427 Approved by: https://github.com/albanD

…8534) Finally got around to doing this, this flag lets us do: ```Python #!/usr/bin/env python3 """ FlexAttention Debug: Using breakpoints and unwrap """ import torch import torch.nn.attention.flex_attention as fa unwrap = torch._C._functorch.get_unwrapped def score_mod(score, batch, head, q_idx, kv_idx): # Set breakpoint here to debug breakpoint() # In debugger, unwrap to see actual tensor values: # >>> actual_score = unwrap(unwrap(unwrap(unwrap(score)))) # >>> actual_batch = unwrap(batch) # >>> actual_head = unwrap(head) # >>> actual_q_idx = unwrap(q_idx) # >>> actual_kv_idx = unwrap(kv_idx) # >>> print(actual_score) # >>> print(f"q_idx: {actual_q_idx}, kv_idx: {actual_kv_idx}") return torch.where(q_idx >= kv_idx, score, torch.tensor(float('-inf'))) def main(): # Enable debug mode fa._FLEX_ATTENTION_DISABLE_COMPILE_DEBUG = True # Small example B, H, S, D = 1, 2, 4, 8 q = torch.randn(B, H, S, D) k = torch.randn(B, H, S, D) v = torch.randn(B, H, S, D) # Run - will hit breakpoint output = fa.flex_attention(q, k, v, score_mod=score_mod) # Disable debug mode fa._FLEX_ATTENTION_DISABLE_COMPILE_DEBUG = False if __name__ == "__main__": main() ``` Pull Request resolved: pytorch#158534 Approved by: https://github.com/Chillee, https://github.com/zou3519

Address second part of pytorch#158366, where torch.tensor(0), is treated as a constant tensor and its .item() gets specailized to 0 which causes a silent specialization. The fix is to unspecialize the constant carries and make them non-constant. Pull Request resolved: pytorch#158381 Approved by: https://github.com/zou3519

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: pytorch#158806 Approved by: https://github.com/pytorchbot

### What - Use `statically_known_true` over `guard_size_oblivious` in cases where we're checking an optimization path. Otherwise, it will DDE and we can't take the safe/slower path. - For broadcast checks, use `fallback=False` if we encounter a DDE. Typically, unbackeds would be ≥2 and that falls inline with size-oblivious reasoning (i.e. when `size_oblivious=True`). ### Example DDE ``` torch._inductor.exc.InductorError: LoweringException: GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq((u0//387), 1) (unhinted: Eq((u0//387), 1)). (Size-like symbols: u0) Caused by: (_inductor/lowering.py:488 in broadcast_symbolic_shapes) ``` ``` torch._inductor.exc.InductorError: LoweringException: GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq((u0//387), 1) (unhinted: Eq((u0//387), 1)). (Size-like symbols: u0) Caused by: (_inductor/ir.py:2797 in create) ``` Pull Request resolved: pytorch#155267 Approved by: https://github.com/eellison

# description Add base docker image for vllm. It seems like we use the base docker image for both pytorch build, and tests. Configure a base image for vllm against pytorch CI. # Others Added readme regarding how the base docker images are used, and how to add one, this also explain what is the right file to modify Pull Request resolved: pytorch#158755 Approved by: https://github.com/seemethere, https://github.com/huydhn

pytorch#158852) Summary: # Why capture relevant data for offline lookup table generation # What report the hinted sizes not just the symbolic sizes Test Plan: ``` buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 | tee /tmp/epx040 ``` This only validates that this change does not break anything, as the schema is not on scuba yet (not actualized) Rollback Plan: Reviewed By: stashuk-olek Differential Revision: D77837548 Pull Request resolved: pytorch#158852 Approved by: https://github.com/jingsh

Changes: 1. rename zip index filename, and keep it out of normalize path. 2. normalize output path for extract file. Extract files successful: <img width="683" height="247" alt="image" src="https://github.com/user-attachments/assets/72dff7b9-5ec0-4523-a6ee-7768b37bbe63" /> Pull Request resolved: pytorch#158702 Approved by: https://github.com/angelayi

…ytorch#158913) MSVC cannot implicitly convert a const iterator to a const pointer. Pull Request resolved: pytorch#158913 Approved by: https://github.com/desertfire Co-authored-by: Xu Han <[email protected]>

Avoid failures caused by tests exiting via sys.exit instead of `unittest.skip` In particular it will not try to start the test (causing forks into subprocess) just to stop them (killing the subprocess) which is done in the test setup Using `unittest.skip` decorators avoids the starting of the test in the first place. Pull Request resolved: pytorch#158846 Approved by: https://github.com/Skylion007

By using [`fillBuffer:range:value:`](https://developer.apple.com/documentation/metal/mtlblitcommandencoder/fillbuffer:range:value:?language=objc) rather than MPSGraph op, which should be faster and also does not have INT_MAX limit Which in turn fixes `test_index_put_accumulate_large_tensor_mps` test Pull Request resolved: pytorch#158874 Approved by: https://github.com/dcci

…nch (pytorch#158847) This PR addresses a few small bugfixes needed to make NanoGPT inference work, and also adds a new `--caching-precompile` argument to torchbench. With `--caching-precompile`, after every benchmark we save precompile artifacts to DynamoCache, allowing us to test caching precompile on all existing benchmarks. The following bugfixes are in this PR to make all of this work: - Fix global variables being pruned with DUPLICATE_INPUT guards. DUPLICATE_INPUT guards have additional vars from the second input, which we track with additional_local_vars, but we never tracked additional global variables. This fixes the issue. (See torch/_dynamo/guards.py changes) - Return None from PRecompileContext.serialize() if no new dynamo compiles occurred. There's no reason to save artifacts (i.e. autotuning artifacts, etc) if no dynamo_compile occurred, so we return None early. We may later want to support editing existing dynamo artifacts as a TODO, but that's upcoming. - log `dynamo_start` on CompilePackage.load: This is only needed so that tlparse doesn't ignore TORCH_TRACE logs generated when caching precompile hits. If there are no actual compiles, we never log a "dynamo_start" entry, which makes internal tlparse ignore the TORCH_TRACE file. ## Test Plan After this PR, the following now works: ``` TORCH_LOGS=dynamo tlp python benchmarks/dynamo/torchbench.py --only nanogpt --performance --inference --backend inductor --caching-precompile --warm-start-latency ``` tlparse result (internal): Cold Start (6 seconds): https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_vk9nkp4m.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Warm Start (~1 s): https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_5l4iwrpm.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 The 1 second of warm start here can be improved: the costs here are mostly in starting up workers and triton and initializing CUDA, a lot of which should not be included in the compile time cost in real world scenarios where these are already loaded before training begins. Pull Request resolved: pytorch#158847 Approved by: https://github.com/zhxchen17

Summary: as title, seems ytpo Test Plan: CI Rollback Plan: Differential Revision: D78758466 Pull Request resolved: pytorch#158855 Approved by: https://github.com/henryoier

Pull Request resolved: pytorch#158703 Approved by: https://github.com/malfet, https://github.com/desertfire ghstack dependencies: pytorch#158349, pytorch#158350, pytorch#158351

…ytorch#158254) An out of tree backend can have its own configuration options that the user can enable to control inductor compilation. These config options need to be taken into account when calculating the key that is used to determine cache miss / hits. This PR allows out of tree backends to specify a custom config module that has the same type as `torch._inductor.config` that can be used to control codegen (in addition to the default config), and will be used when creating the cache key. Pull Request resolved: pytorch#158254 Approved by: https://github.com/eellison

…156975) Pull Request resolved: pytorch#156975 Approved by: https://github.com/zou3519

) Pull Request resolved: pytorch#156977 Approved by: https://github.com/XuehaiPan, https://github.com/jansel ghstack dependencies: pytorch#156975

Pull Request resolved: pytorch#156976 Approved by: https://github.com/zou3519 ghstack dependencies: pytorch#156975, pytorch#156977

This pull request adds a new CI workflow for Windows Arm64, named win-arm64-build-test.yml. It can be triggered on any pull request by including the ciflow/win-arm64 tag. Pull Request resolved: pytorch#148753 Approved by: https://github.com/malfet

@angelayi

With many PRs landed, we can run the first aot inductor example on Windows. <img width="640" height="427" alt="image" src="https://github.com/user-attachments/assets/131db159-ce17-4857-a3d5-a4b03638f01d" /> Let's remove the Windows check on `AotCodeCompiler`. CC: @angelayi , @desertfire , @jansel Pull Request resolved: pytorch#158915 Approved by: https://github.com/desertfire

…pytorch#158765) tlparse looks like this <img width="1165" height="226" alt="image" src="https://github.com/user-attachments/assets/04c4e6b1-34a3-4d9d-8304-6eb6d9a94980" /> This will aid in reading guards. Pull Request resolved: pytorch#158765 Approved by: https://github.com/Lucaskabela, https://github.com/StrongerXi

This reverts commit 9df0f56. Reverted pytorch#158650 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78805560 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](pytorch#158650 (comment)))

This reverts commit 5702491. Reverted pytorch#158846 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking trunk. See distributed/_composable/fsdp/test_fully_shard_logging.py::LoggingTests::test_fsdp_logging [GH job link](https://github.com/pytorch/pytorch/actions/runs/16472103496/job/46564570609) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/57024913c409764f129d6a7792625f5b05462e31) ([comment](pytorch#158846 (comment)))

Fix up decomposeK autotuning, by removing condition to return more than `k_splits_limit` and setting default to 10 instead of 5. Allow `k_splits_limit` to be configurable to the user via `TORCHINDUCTOR_NUM_DECOMPOSE_K_SPLITS` and also allow user to configure threshold in which to use decompose_k via `TORCHINDUCTOR_DECOMPOSE_K_THRESHOLD` Pull Request resolved: pytorch#158745 Approved by: https://github.com/eellison

… torchbench (pytorch#158847)" This reverts commit d898d0d. Reverted pytorch#158847 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI jobs on MI200 and MI300 ([comment](pytorch#158847 (comment)))

) Pull Request resolved: pytorch#158866 Approved by: https://github.com/janeyx99

Co-authored-by: Yu, Guangye <[email protected]>

Chao1Han force-pushed the xpu_stage branch from 8a6ebbb to 2fb74ce Compare July 15, 2025 05:35

zhangxiaoli73 reviewed Jul 15, 2025

View reviewed changes

pytorchmergebot and others added 27 commits July 17, 2025 20:58

Revert "Cleanup old caffe2 scripts (pytorch#158475)"

ced5cf0

This reverts commit 94d7f0c. Reverted pytorch#158475 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#158475 (comment)))

[dynamo] Skip training flag check id already guarding on nn modules (p…

af66240

…ytorch#158492) This might help some legacy models that still have inline_inbuilt_nn_modules False for some reason. Pull Request resolved: pytorch#158492 Approved by: https://github.com/StrongerXi

Add missing <vector> in c10/util/WaitCounter.h (pytorch#158354)

74f4cf4

It seems that `#include <vector>` is being pulled in indirectly, but it is being used directly, so it is best to explicitly include it. Pull Request resolved: pytorch#158354 Approved by: https://github.com/janeyx99

[ROCm][CI] Last known good HIP patch (pytorch#158596)

2df2e3b

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#158596 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <[email protected]>

Fix test linalg for MKL upgrading (pytorch#158312)

6673ac7

Fixes pytorch#158054 Pull Request resolved: pytorch#158312 Approved by: https://github.com/albanD

Add stride check for attn_mask on non-cpu device (pytorch#158424)

ef38edb

Fixes pytorch#158374 Pull Request resolved: pytorch#158424 Approved by: https://github.com/Valentine233, https://github.com/drisspg, https://github.com/atalman

[DTensor] Document redistribute_costs (pytorch#158495)

ddbecdf

Pull Request resolved: pytorch#158495 Approved by: https://github.com/zpcore, https://github.com/XilunWu

Make torch.distributed.breakpoint() set a long timeout (pytorch#158481)

89d842f

Pull Request resolved: pytorch#158481 Approved by: https://github.com/d4l3k ghstack dependencies: pytorch#158469

Revert "Add torch compile force disable caches alias (pytorch#158072)"

9a7c2f1

This reverts commit 2ecf083. Reverted pytorch#158072 on behalf of https://github.com/jeffdaily due to fails on rocm, signal ignored while rocm was unstable ([comment](pytorch#158072 (comment)))

pytorchupdatebot and others added 29 commits July 23, 2025 04:41

[Inductor] MSVC use pointer when generating temporary array pointer (p…

ee72338

…ytorch#158913) MSVC cannot implicitly convert a const iterator to a const pointer. Pull Request resolved: pytorch#158913 Approved by: https://github.com/desertfire Co-authored-by: Xu Han <[email protected]>

[export][ez] Fix packaging (pytorch#158855)

2a60b8f

Summary: as title, seems ytpo Test Plan: CI Rollback Plan: Differential Revision: D78758466 Pull Request resolved: pytorch#158855 Approved by: https://github.com/henryoier

[aoti][mps] Enable more tests (pytorch#158703)

7d296d5

Pull Request resolved: pytorch#158703 Approved by: https://github.com/malfet, https://github.com/desertfire ghstack dependencies: pytorch#158349, pytorch#158350, pytorch#158351

[math] Raise exception in Dynamo if constant fold call fail (pytorch#…

671e22a

…156975) Pull Request resolved: pytorch#156975 Approved by: https://github.com/zou3519

[struct] Add struct.pack and struct.unpack polyfills (pytorch#156977

f5314f8

) Pull Request resolved: pytorch#156977 Approved by: https://github.com/XuehaiPan, https://github.com/jansel ghstack dependencies: pytorch#156975

[math] Trace float.fromhex (pytorch#156976)

576253c

Pull Request resolved: pytorch#156976 Approved by: https://github.com/zou3519 ghstack dependencies: pytorch#156975, pytorch#156977

CI for Windows Arm64 (pytorch#148753)

00da8e6

This pull request adds a new CI workflow for Windows Arm64, named win-arm64-build-test.yml. It can be triggered on any pull request by including the ciflow/win-arm64 tag. Pull Request resolved: pytorch#148753 Approved by: https://github.com/malfet

Add zero_() and empty_like(t) to torch/csrc/stable/ops.h (pytorch#158866

fef236d

) Pull Request resolved: pytorch#158866 Approved by: https://github.com/janeyx99

Device agnostic for DCP

5fe1f5f

Commit suggestion

cecca5e

Update test/distributed/checkpoint/_experimental/test_staging.py

b804495

Co-authored-by: Yu, Guangye <[email protected]>

Update test/distributed/checkpoint/_experimental/test_staging.py

12e06c2

Co-authored-by: Yu, Guangye <[email protected]>

Update test/distributed/checkpoint/_experimental/test_builder.py

adb5261

Co-authored-by: Yu, Guangye <[email protected]>

acc comment

615fb77

pytorchmergebot force-pushed the xpu_stage branch from 5298813 to 615fb77 Compare July 24, 2025 02:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Device agnostic for DCP #19

Device agnostic for DCP #19

Uh oh!

Chao1Han commented Jul 14, 2025 •

edited

Loading

Uh oh!

zhangxiaoli73 Jul 15, 2025

Uh oh!

Chao1Han Jul 15, 2025

Uh oh!

zhangxiaoli73 Jul 15, 2025

Uh oh!

Chao1Han Jul 15, 2025

Uh oh!

Uh oh!

Device agnostic for DCP #19

Are you sure you want to change the base?

Device agnostic for DCP #19

Uh oh!

Conversation

Chao1Han commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhangxiaoli73 Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

Chao1Han Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

zhangxiaoli73 Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

Chao1Han Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Chao1Han commented Jul 14, 2025 •

edited

Loading