Compare microbatch forward outputs and gradients #246

xmfan · 2025-11-11T22:51:09Z

Stacked PRs:

->Compare microbatch forward outputs and gradients #246

Currently the forward matches per microbatch (no batch invariance)

But for the backward, all grads are None

Intended usage:

> torchrun --standalone --nproc-per-node 4 examples/example_ds3_local_map.py --rng-seed 42; torchrun --standalone --nproc-per-node 8 examples/example_ds3_pp.py --rng-seed 42

> (a) [14:56:32] ~/core/a/autoparallel (branch2) > diff out/0/diff.log out/1/diff.log 
--- out/0/diff.log      2025-11-13 14:38:31.018089358 -0800
+++ out/1/diff.log      2025-11-13 14:39:34.621585827 -0800
@@ -15,62 +15,62 @@
 [mb13 fwd out] hash=18431966637432242176, norm=2024.0
 [mb14 fwd out] hash=18356249868697075712, norm=2016.0
 [mb15 fwd out] hash=9121302173425074176, norm=2024.0
-[grad tok_embeddings.weight] hash=9152616264584134656, norm=782336.0
-[grad layers.0.attention.wq.weight] hash=4509440233137242112, norm=4521984.0
-[grad layers.0.attention.wkv_a.weight] hash=9191424626998116352, norm=6389760.0
-[grad layers.0.attention.kv_norm.weight] hash=9340817470887297024, norm=577536.0
-[grad layers.0.attention.wkv_b.weight] hash=63648529108697088, norm=6684672.0
-[grad layers.0.attention.wo.weight] hash=18415711457527201792, norm=5275648.0
-[grad layers.0.attention_norm.weight] hash=9332549143446421504, norm=1040384.0
-[grad layers.0.ffn_norm.weight] hash=29554872554618880, norm=172032.0
-[grad layers.0.moe.experts.w1] hash=4516723398159630336, norm=405504.0
-[grad layers.0.moe.experts.w2] hash=4651444358887768064, norm=798720.0
-[grad layers.0.moe.experts.w3] hash=9277591154243665920, norm=456704.0
-[grad layers.0.moe.router.gate.weight] hash=13836676536398249984, norm=444416.0
-[grad layers.0.moe.shared_experts.w1.weight] hash=4634309569680506880, norm=1966080.0
-[grad layers.0.moe.shared_experts.w2.weight] hash=18442557133430980608, norm=2555904.0
-[grad layers.0.moe.shared_experts.w3.weight] hash=18363075636882309120, norm=2621440.0
-[grad layers.1.attention.wq.weight] hash=18434148068501749760, norm=2088960.0
-[grad layers.1.attention.wkv_a.weight] hash=13734325197991837696, norm=3719168.0
-[grad layers.1.attention.kv_norm.weight] hash=9253525043734904832, norm=194560.0
-[grad layers.1.attention.wkv_b.weight] hash=9254369468665036800, norm=4161536.0
-[grad layers.1.attention.wo.weight] hash=13764267098639433728, norm=3375104.0
-[grad layers.1.attention_norm.weight] hash=96229257662955520, norm=438272.0
-[grad layers.1.ffn_norm.weight] hash=26036435345735680, norm=91648.0
-[grad layers.1.moe.experts.w1] hash=13808106826262118400, norm=143360.0
-[grad layers.1.moe.experts.w2] hash=18375073507764600832, norm=231424.0
-[grad layers.1.moe.experts.w3] hash=4683075109395628032, norm=136192.0
-[grad layers.1.moe.router.gate.weight] hash=71248353479884800, norm=140288.0
-[grad layers.1.moe.shared_experts.w1.weight] hash=9272770895267495936, norm=917504.0
-[grad layers.1.moe.shared_experts.w2.weight] hash=9259084174524940288, norm=1581056.0
-[grad layers.1.moe.shared_experts.w3.weight] hash=13739356563200540672, norm=1441792.0
-[grad layers.2.attention.wq.weight] hash=4623754258053857280, norm=1073152.0
-[grad layers.2.attention.wkv_a.weight] hash=4683638059349049344, norm=2736128.0
-[grad layers.2.attention.kv_norm.weight] hash=9277239310522777600, norm=144384.0
-[grad layers.2.attention.wkv_b.weight] hash=4701089507905110016, norm=3112960.0
-[grad layers.2.attention.wo.weight] hash=9260878577501470720, norm=2490368.0
-[grad layers.2.attention_norm.weight] hash=70931694131085312, norm=226304.0
-[grad layers.2.ffn_norm.weight] hash=9248775153502912512, norm=67584.0
-[grad layers.2.moe.experts.w1] hash=38597256181448704, norm=65536.0
-[grad layers.2.moe.experts.w2] hash=4636702106982547456, norm=80384.0
-[grad layers.2.moe.experts.w3] hash=18401039574366158848, norm=98816.0
-[grad layers.2.moe.router.gate.weight] hash=53761720551735296, norm=39680.0
-[grad layers.2.moe.shared_experts.w1.weight] hash=9279455925964374016, norm=573440.0
-[grad layers.2.moe.shared_experts.w2.weight] hash=18333204104978890752, norm=901120.0
-[grad layers.2.moe.shared_experts.w3.weight] hash=13692420610834038784, norm=1236992.0
-[grad layers.3.attention.wq.weight] hash=13692490979578216448, norm=782336.0
-[grad layers.3.attention.wkv_a.weight] hash=4507716198904889344, norm=2621440.0
-[grad layers.3.attention.kv_norm.weight] hash=9268267295640125440, norm=126464.0
-[grad layers.3.attention.wkv_b.weight] hash=4751473528736317440, norm=2752512.0
-[grad layers.3.attention.wo.weight] hash=9285683559824097280, norm=2441216.0
-[grad layers.3.attention_norm.weight] hash=9234771773411557376, norm=182272.0
-[grad layers.3.ffn_norm.weight] hash=6896136929411072, norm=36096.0
-[grad layers.3.moe.experts.w1] hash=18395339706087768064, norm=35072.0
-[grad layers.3.moe.experts.w2] hash=13882029192020754432, norm=26240.0
-[grad layers.3.moe.experts.w3] hash=4743768151248863232, norm=48896.0
-[grad layers.3.moe.router.gate.weight] hash=18414057792039026688, norm=27392.0
-[grad layers.3.moe.shared_experts.w1.weight] hash=4598562247638253568, norm=471040.0
-[grad layers.3.moe.shared_experts.w2.weight] hash=18396852634087587840, norm=643072.0
-[grad layers.3.moe.shared_experts.w3.weight] hash=9344863673677512704, norm=802816.0
-[grad norm.weight] hash=9309010798518992896, norm=319488.0
-[grad output.weight] hash=0, norm=8650752.0
+[grad tok_embeddings.weight] None
+[grad layers.0.attention.wq.weight] None
+[grad layers.0.attention.wkv_a.weight] None
+[grad layers.0.attention.kv_norm.weight] None
+[grad layers.0.attention.wkv_b.weight] None
+[grad layers.0.attention.wo.weight] None
+[grad layers.0.attention_norm.weight] None
+[grad layers.0.ffn_norm.weight] None
+[grad layers.0.moe.experts.w1] None
+[grad layers.0.moe.experts.w2] None
+[grad layers.0.moe.experts.w3] None
+[grad layers.0.moe.router.gate.weight] None
+[grad layers.0.moe.shared_experts.w1.weight] None
+[grad layers.0.moe.shared_experts.w2.weight] None
+[grad layers.0.moe.shared_experts.w3.weight] None
+[grad layers.1.attention.wq.weight] None
+[grad layers.1.attention.wkv_a.weight] None
+[grad layers.1.attention.kv_norm.weight] None
+[grad layers.1.attention.wkv_b.weight] None
+[grad layers.1.attention.wo.weight] None
+[grad layers.1.attention_norm.weight] None
+[grad layers.1.ffn_norm.weight] None
+[grad layers.1.moe.experts.w1] None
+[grad layers.1.moe.experts.w2] None
+[grad layers.1.moe.experts.w3] None
+[grad layers.1.moe.router.gate.weight] None
+[grad layers.1.moe.shared_experts.w1.weight] None
+[grad layers.1.moe.shared_experts.w2.weight] None
+[grad layers.1.moe.shared_experts.w3.weight] None
+[grad layers.2.attention.wq.weight] None
+[grad layers.2.attention.wkv_a.weight] None
+[grad layers.2.attention.kv_norm.weight] None
+[grad layers.2.attention.wkv_b.weight] None
+[grad layers.2.attention.wo.weight] None
+[grad layers.2.attention_norm.weight] None
+[grad layers.2.ffn_norm.weight] None
+[grad layers.2.moe.experts.w1] None
+[grad layers.2.moe.experts.w2] None
+[grad layers.2.moe.experts.w3] None
+[grad layers.2.moe.router.gate.weight] None
+[grad layers.2.moe.shared_experts.w1.weight] None
+[grad layers.2.moe.shared_experts.w2.weight] None
+[grad layers.2.moe.shared_experts.w3.weight] None
+[grad layers.3.attention.wq.weight] None
+[grad layers.3.attention.wkv_a.weight] None
+[grad layers.3.attention.kv_norm.weight] None
+[grad layers.3.attention.wkv_b.weight] None
+[grad layers.3.attention.wo.weight] None
+[grad layers.3.attention_norm.weight] None
+[grad layers.3.ffn_norm.weight] None
+[grad layers.3.moe.experts.w1] None
+[grad layers.3.moe.experts.w2] None
+[grad layers.3.moe.experts.w3] None
+[grad layers.3.moe.router.gate.weight] None
+[grad layers.3.moe.shared_experts.w1.weight] None
+[grad layers.3.moe.shared_experts.w2.weight] None
+[grad layers.3.moe.shared_experts.w3.weight] None
+[grad norm.weight] None
+[grad output.weight] None

Currently, fw ins are the same, but the forward is being ran with different rng state between the two setups so there's some numerical differences

stack-info: PR: #246, branch: xmfan/stack/20

xmfan added a commit that referenced this pull request Nov 11, 2025

Log forward intermediates hashes w/pp vs w/o pp

79bf049

stack-info: PR: #246, branch: xmfan/stack/20

xmfan force-pushed the xmfan/stack/19 branch from 0813cd5 to 580144b Compare November 11, 2025 22:51

xmfan force-pushed the xmfan/stack/20 branch from 72c4ffc to 79bf049 Compare November 11, 2025 22:51

This was referenced Nov 11, 2025

Log weight hashes for DSv3 w/ pp vs w/o pp #240

Merged

Custom opify triton kernel until local_map functionalization is fixed #245

Merged

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 11, 2025

xmfan changed the title ~~Log forward intermediates hashes w/pp vs w/o pp~~ Log forward intermediates/output hashes w/o pp Nov 11, 2025

xmfan changed the base branch from xmfan/stack/19 to main November 12, 2025 00:04

xmfan added a commit that referenced this pull request Nov 12, 2025

Log forward intermediates hashes w/pp vs w/o pp

4b0b462

stack-info: PR: #246, branch: xmfan/stack/20

xmfan force-pushed the xmfan/stack/20 branch from 79bf049 to 4b0b462 Compare November 12, 2025 00:04

xmfan changed the title ~~Log forward intermediates/output hashes w/o pp~~ Log forward intermediates hashes w/pp vs w/o pp Nov 12, 2025

xmfan changed the base branch from main to xmfan/stack/19 November 12, 2025 00:05

xmfan changed the base branch from xmfan/stack/19 to main November 12, 2025 05:02

xmfan added a commit that referenced this pull request Nov 12, 2025

Log forward intermediates hashes w/pp vs w/o pp

b9d82ef

stack-info: PR: #246, branch: xmfan/stack/20

xmfan force-pushed the xmfan/stack/20 branch from 4b0b462 to b9d82ef Compare November 12, 2025 05:02

xmfan changed the base branch from main to xmfan/stack/19 November 12, 2025 05:02

xmfan changed the base branch from xmfan/stack/19 to main November 12, 2025 05:09

xmfan added a commit that referenced this pull request Nov 12, 2025

Log forward intermediates hashes w/pp vs w/o pp

adbd32c

stack-info: PR: #246, branch: xmfan/stack/20

xmfan force-pushed the xmfan/stack/20 branch from b9d82ef to adbd32c Compare November 12, 2025 05:09

xmfan changed the base branch from main to xmfan/stack/19 November 12, 2025 05:09

xmfan changed the base branch from xmfan/stack/19 to main November 12, 2025 06:50

xmfan added a commit that referenced this pull request Nov 12, 2025

Log forward intermediates hashes w/pp vs w/o pp

f984301

stack-info: PR: #246, branch: xmfan/stack/20

xmfan force-pushed the xmfan/stack/20 branch from adbd32c to f984301 Compare November 12, 2025 06:50

xmfan changed the base branch from main to xmfan/stack/19 November 12, 2025 06:50

xmfan marked this pull request as ready for review November 12, 2025 07:18

xmfan requested review from bdhirsh, sanketpurandare and wconstab November 12, 2025 07:19

xmfan force-pushed the xmfan/stack/19 branch from 6e8451c to 59670d0 Compare November 13, 2025 20:08

xmfan marked this pull request as draft November 13, 2025 20:09

Compare microbatch forward outputs and gradients

e5c0227

stack-info: PR: #246, branch: xmfan/stack/20

xmfan changed the base branch from xmfan/stack/19 to main November 13, 2025 22:55

xmfan force-pushed the xmfan/stack/20 branch from f984301 to e5c0227 Compare November 13, 2025 22:55

xmfan changed the title ~~Log forward intermediates hashes w/pp vs w/o pp~~ Compare microbatch forward outputs and gradients Nov 13, 2025

xmfan marked this pull request as ready for review November 13, 2025 22:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compare microbatch forward outputs and gradients #246

Compare microbatch forward outputs and gradients #246

xmfan commented Nov 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Compare microbatch forward outputs and gradients #246

Are you sure you want to change the base?

Compare microbatch forward outputs and gradients #246

Conversation

xmfan commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xmfan commented Nov 11, 2025 •

edited

Loading