Skip to content

Conversation

@xmfan
Copy link
Member

@xmfan xmfan commented Nov 11, 2025

Stacked PRs:


Currently the forward matches per microbatch (no batch invariance)

But for the backward, all grads are None

Intended usage:

> torchrun --standalone --nproc-per-node 4 examples/example_ds3_local_map.py --rng-seed 42; torchrun --standalone --nproc-per-node 8 examples/example_ds3_pp.py --rng-seed 42

> (a) [14:56:32] ~/core/a/autoparallel (branch2) > diff out/0/diff.log out/1/diff.log 
--- out/0/diff.log      2025-11-13 14:38:31.018089358 -0800
+++ out/1/diff.log      2025-11-13 14:39:34.621585827 -0800
@@ -15,62 +15,62 @@
 [mb13 fwd out] hash=18431966637432242176, norm=2024.0
 [mb14 fwd out] hash=18356249868697075712, norm=2016.0
 [mb15 fwd out] hash=9121302173425074176, norm=2024.0
-[grad tok_embeddings.weight] hash=9152616264584134656, norm=782336.0
-[grad layers.0.attention.wq.weight] hash=4509440233137242112, norm=4521984.0
-[grad layers.0.attention.wkv_a.weight] hash=9191424626998116352, norm=6389760.0
-[grad layers.0.attention.kv_norm.weight] hash=9340817470887297024, norm=577536.0
-[grad layers.0.attention.wkv_b.weight] hash=63648529108697088, norm=6684672.0
-[grad layers.0.attention.wo.weight] hash=18415711457527201792, norm=5275648.0
-[grad layers.0.attention_norm.weight] hash=9332549143446421504, norm=1040384.0
-[grad layers.0.ffn_norm.weight] hash=29554872554618880, norm=172032.0
-[grad layers.0.moe.experts.w1] hash=4516723398159630336, norm=405504.0
-[grad layers.0.moe.experts.w2] hash=4651444358887768064, norm=798720.0
-[grad layers.0.moe.experts.w3] hash=9277591154243665920, norm=456704.0
-[grad layers.0.moe.router.gate.weight] hash=13836676536398249984, norm=444416.0
-[grad layers.0.moe.shared_experts.w1.weight] hash=4634309569680506880, norm=1966080.0
-[grad layers.0.moe.shared_experts.w2.weight] hash=18442557133430980608, norm=2555904.0
-[grad layers.0.moe.shared_experts.w3.weight] hash=18363075636882309120, norm=2621440.0
-[grad layers.1.attention.wq.weight] hash=18434148068501749760, norm=2088960.0
-[grad layers.1.attention.wkv_a.weight] hash=13734325197991837696, norm=3719168.0
-[grad layers.1.attention.kv_norm.weight] hash=9253525043734904832, norm=194560.0
-[grad layers.1.attention.wkv_b.weight] hash=9254369468665036800, norm=4161536.0
-[grad layers.1.attention.wo.weight] hash=13764267098639433728, norm=3375104.0
-[grad layers.1.attention_norm.weight] hash=96229257662955520, norm=438272.0
-[grad layers.1.ffn_norm.weight] hash=26036435345735680, norm=91648.0
-[grad layers.1.moe.experts.w1] hash=13808106826262118400, norm=143360.0
-[grad layers.1.moe.experts.w2] hash=18375073507764600832, norm=231424.0
-[grad layers.1.moe.experts.w3] hash=4683075109395628032, norm=136192.0
-[grad layers.1.moe.router.gate.weight] hash=71248353479884800, norm=140288.0
-[grad layers.1.moe.shared_experts.w1.weight] hash=9272770895267495936, norm=917504.0
-[grad layers.1.moe.shared_experts.w2.weight] hash=9259084174524940288, norm=1581056.0
-[grad layers.1.moe.shared_experts.w3.weight] hash=13739356563200540672, norm=1441792.0
-[grad layers.2.attention.wq.weight] hash=4623754258053857280, norm=1073152.0
-[grad layers.2.attention.wkv_a.weight] hash=4683638059349049344, norm=2736128.0
-[grad layers.2.attention.kv_norm.weight] hash=9277239310522777600, norm=144384.0
-[grad layers.2.attention.wkv_b.weight] hash=4701089507905110016, norm=3112960.0
-[grad layers.2.attention.wo.weight] hash=9260878577501470720, norm=2490368.0
-[grad layers.2.attention_norm.weight] hash=70931694131085312, norm=226304.0
-[grad layers.2.ffn_norm.weight] hash=9248775153502912512, norm=67584.0
-[grad layers.2.moe.experts.w1] hash=38597256181448704, norm=65536.0
-[grad layers.2.moe.experts.w2] hash=4636702106982547456, norm=80384.0
-[grad layers.2.moe.experts.w3] hash=18401039574366158848, norm=98816.0
-[grad layers.2.moe.router.gate.weight] hash=53761720551735296, norm=39680.0
-[grad layers.2.moe.shared_experts.w1.weight] hash=9279455925964374016, norm=573440.0
-[grad layers.2.moe.shared_experts.w2.weight] hash=18333204104978890752, norm=901120.0
-[grad layers.2.moe.shared_experts.w3.weight] hash=13692420610834038784, norm=1236992.0
-[grad layers.3.attention.wq.weight] hash=13692490979578216448, norm=782336.0
-[grad layers.3.attention.wkv_a.weight] hash=4507716198904889344, norm=2621440.0
-[grad layers.3.attention.kv_norm.weight] hash=9268267295640125440, norm=126464.0
-[grad layers.3.attention.wkv_b.weight] hash=4751473528736317440, norm=2752512.0
-[grad layers.3.attention.wo.weight] hash=9285683559824097280, norm=2441216.0
-[grad layers.3.attention_norm.weight] hash=9234771773411557376, norm=182272.0
-[grad layers.3.ffn_norm.weight] hash=6896136929411072, norm=36096.0
-[grad layers.3.moe.experts.w1] hash=18395339706087768064, norm=35072.0
-[grad layers.3.moe.experts.w2] hash=13882029192020754432, norm=26240.0
-[grad layers.3.moe.experts.w3] hash=4743768151248863232, norm=48896.0
-[grad layers.3.moe.router.gate.weight] hash=18414057792039026688, norm=27392.0
-[grad layers.3.moe.shared_experts.w1.weight] hash=4598562247638253568, norm=471040.0
-[grad layers.3.moe.shared_experts.w2.weight] hash=18396852634087587840, norm=643072.0
-[grad layers.3.moe.shared_experts.w3.weight] hash=9344863673677512704, norm=802816.0
-[grad norm.weight] hash=9309010798518992896, norm=319488.0
-[grad output.weight] hash=0, norm=8650752.0
+[grad tok_embeddings.weight] None
+[grad layers.0.attention.wq.weight] None
+[grad layers.0.attention.wkv_a.weight] None
+[grad layers.0.attention.kv_norm.weight] None
+[grad layers.0.attention.wkv_b.weight] None
+[grad layers.0.attention.wo.weight] None
+[grad layers.0.attention_norm.weight] None
+[grad layers.0.ffn_norm.weight] None
+[grad layers.0.moe.experts.w1] None
+[grad layers.0.moe.experts.w2] None
+[grad layers.0.moe.experts.w3] None
+[grad layers.0.moe.router.gate.weight] None
+[grad layers.0.moe.shared_experts.w1.weight] None
+[grad layers.0.moe.shared_experts.w2.weight] None
+[grad layers.0.moe.shared_experts.w3.weight] None
+[grad layers.1.attention.wq.weight] None
+[grad layers.1.attention.wkv_a.weight] None
+[grad layers.1.attention.kv_norm.weight] None
+[grad layers.1.attention.wkv_b.weight] None
+[grad layers.1.attention.wo.weight] None
+[grad layers.1.attention_norm.weight] None
+[grad layers.1.ffn_norm.weight] None
+[grad layers.1.moe.experts.w1] None
+[grad layers.1.moe.experts.w2] None
+[grad layers.1.moe.experts.w3] None
+[grad layers.1.moe.router.gate.weight] None
+[grad layers.1.moe.shared_experts.w1.weight] None
+[grad layers.1.moe.shared_experts.w2.weight] None
+[grad layers.1.moe.shared_experts.w3.weight] None
+[grad layers.2.attention.wq.weight] None
+[grad layers.2.attention.wkv_a.weight] None
+[grad layers.2.attention.kv_norm.weight] None
+[grad layers.2.attention.wkv_b.weight] None
+[grad layers.2.attention.wo.weight] None
+[grad layers.2.attention_norm.weight] None
+[grad layers.2.ffn_norm.weight] None
+[grad layers.2.moe.experts.w1] None
+[grad layers.2.moe.experts.w2] None
+[grad layers.2.moe.experts.w3] None
+[grad layers.2.moe.router.gate.weight] None
+[grad layers.2.moe.shared_experts.w1.weight] None
+[grad layers.2.moe.shared_experts.w2.weight] None
+[grad layers.2.moe.shared_experts.w3.weight] None
+[grad layers.3.attention.wq.weight] None
+[grad layers.3.attention.wkv_a.weight] None
+[grad layers.3.attention.kv_norm.weight] None
+[grad layers.3.attention.wkv_b.weight] None
+[grad layers.3.attention.wo.weight] None
+[grad layers.3.attention_norm.weight] None
+[grad layers.3.ffn_norm.weight] None
+[grad layers.3.moe.experts.w1] None
+[grad layers.3.moe.experts.w2] None
+[grad layers.3.moe.experts.w3] None
+[grad layers.3.moe.router.gate.weight] None
+[grad layers.3.moe.shared_experts.w1.weight] None
+[grad layers.3.moe.shared_experts.w2.weight] None
+[grad layers.3.moe.shared_experts.w3.weight] None
+[grad norm.weight] None
+[grad output.weight] None

Currently, fw ins are the same, but the forward is being ran with different rng state between the two setups so there's some numerical differences

xmfan added a commit that referenced this pull request Nov 11, 2025
stack-info: PR: #246, branch: xmfan/stack/20
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 11, 2025
@xmfan xmfan changed the title Log forward intermediates hashes w/pp vs w/o pp Log forward intermediates/output hashes w/o pp Nov 11, 2025
@xmfan xmfan changed the base branch from xmfan/stack/19 to main November 12, 2025 00:04
xmfan added a commit that referenced this pull request Nov 12, 2025
stack-info: PR: #246, branch: xmfan/stack/20
@xmfan xmfan changed the title Log forward intermediates/output hashes w/o pp Log forward intermediates hashes w/pp vs w/o pp Nov 12, 2025
@xmfan xmfan changed the base branch from main to xmfan/stack/19 November 12, 2025 00:05
@xmfan xmfan changed the base branch from xmfan/stack/19 to main November 12, 2025 05:02
xmfan added a commit that referenced this pull request Nov 12, 2025
stack-info: PR: #246, branch: xmfan/stack/20
@xmfan xmfan changed the base branch from main to xmfan/stack/19 November 12, 2025 05:02
@xmfan xmfan changed the base branch from xmfan/stack/19 to main November 12, 2025 05:09
xmfan added a commit that referenced this pull request Nov 12, 2025
stack-info: PR: #246, branch: xmfan/stack/20
@xmfan xmfan changed the base branch from main to xmfan/stack/19 November 12, 2025 05:09
@xmfan xmfan changed the base branch from xmfan/stack/19 to main November 12, 2025 06:50
xmfan added a commit that referenced this pull request Nov 12, 2025
stack-info: PR: #246, branch: xmfan/stack/20
@xmfan xmfan changed the base branch from main to xmfan/stack/19 November 12, 2025 06:50
@xmfan xmfan marked this pull request as ready for review November 12, 2025 07:18
@xmfan xmfan marked this pull request as draft November 13, 2025 20:09
stack-info: PR: #246, branch: xmfan/stack/20
@xmfan xmfan changed the base branch from xmfan/stack/19 to main November 13, 2025 22:55
@xmfan xmfan changed the title Log forward intermediates hashes w/pp vs w/o pp Compare microbatch forward outputs and gradients Nov 13, 2025
@xmfan xmfan marked this pull request as ready for review November 13, 2025 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants