Commit 8c7a58c
Support get/set the whole row of metaheader+weight+optimizer from backend for checkpoint saving/loading (pytorch#4429)
Summary:
Pull Request resolved: pytorch#4429
X-link: facebookresearch/FBGEMM#1495
X-link: meta-pytorch/torchrec#3148
# Context
In our current KVZCH cp loading flow, we will keep hold of weight_id, weight, optimizer tensors throughout the checkpoint loading lifecycle, and at the end when all these tensors are downloaded in hand, we will explicitly call "apply_state_dict" to actually write them by chunk to the backend to ensure id->weight and id->opt are mapped correctly. The problem is when we have large number of weights, we will be short of memory since we need to hold all 3 tensors (double memory issue). To solve this challenge, we are going to save the whole row of (metaheader + weight + opt) as the same "weight" tensor during checkpoint saving, and when downloading the checkpoint, we will be able to extract the id from the header, and directly write the weight+opt part to the backend by id. When loading cp for optimizer, we added a no-op KVTensor, so it won't need to write to backend for optimizer states again.
# This diff
* added `backend_return_whole_row` flag in KVZCH params, with validation to make sure it's only True when opt_offloading is used
* added `read_only_` flag in KVTensorWrapper to be used for checkpoint calls. When read-only=True, all write operations to this KVT will be no-op
* added metadata recalc for optimizer state dict, because we are now returning read-only KVT for opt state dict, and model store will need to correct the global metadata before creating the save plan for KVZCH opt tensors
* updated dram backend and mem pool, so it can return the metaheader + weight + optimizer_state together, as well as set them back to backend (use pointers to skip metaheader part when write weight+opt to backend)
* by default the opt offloading and return whole row is False on trunk, so should not break existing KVZCH runs
Differential Revision: D776041581 parent f6100fc commit 8c7a58c
File tree
13 files changed
+866
-54
lines changed- fbgemm_gpu
- fbgemm_gpu
- tbe/ssd
- src
- dram_kv_embedding_cache
- ssd_split_embeddings_cache
- test/tbe/ssd
13 files changed
+866
-54
lines changedLines changed: 6 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
106 | 106 | | |
107 | 107 | | |
108 | 108 | | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
109 | 112 | | |
110 | 113 | | |
111 | 114 | | |
112 | 115 | | |
113 | 116 | | |
114 | 117 | | |
115 | 118 | | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
116 | 122 | | |
117 | 123 | | |
118 | 124 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
187 | 187 | | |
188 | 188 | | |
189 | 189 | | |
| 190 | + | |
190 | 191 | | |
191 | 192 | | |
192 | 193 | | |
193 | 194 | | |
194 | 195 | | |
195 | 196 | | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
196 | 201 | | |
197 | 202 | | |
198 | 203 | | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
199 | 214 | | |
200 | 215 | | |
201 | 216 | | |
| |||
612 | 627 | | |
613 | 628 | | |
614 | 629 | | |
615 | | - | |
| 630 | + | |
616 | 631 | | |
617 | 632 | | |
618 | 633 | | |
619 | 634 | | |
620 | 635 | | |
621 | | - | |
| 636 | + | |
| 637 | + | |
622 | 638 | | |
623 | 639 | | |
624 | 640 | | |
| |||
659 | 675 | | |
660 | 676 | | |
661 | 677 | | |
| 678 | + | |
662 | 679 | | |
663 | 680 | | |
664 | 681 | | |
| |||
2246 | 2263 | | |
2247 | 2264 | | |
2248 | 2265 | | |
2249 | | - | |
| 2266 | + | |
| 2267 | + | |
| 2268 | + | |
2250 | 2269 | | |
2251 | 2270 | | |
2252 | 2271 | | |
| 2272 | + | |
2253 | 2273 | | |
2254 | 2274 | | |
2255 | 2275 | | |
2256 | 2276 | | |
2257 | 2277 | | |
2258 | | - | |
| 2278 | + | |
2259 | 2279 | | |
2260 | 2280 | | |
2261 | 2281 | | |
| |||
2264 | 2284 | | |
2265 | 2285 | | |
2266 | 2286 | | |
2267 | | - | |
2268 | | - | |
2269 | | - | |
2270 | | - | |
2271 | | - | |
2272 | | - | |
2273 | | - | |
2274 | | - | |
2275 | | - | |
2276 | | - | |
2277 | | - | |
2278 | | - | |
2279 | | - | |
2280 | | - | |
2281 | | - | |
2282 | | - | |
2283 | | - | |
2284 | | - | |
2285 | | - | |
| 2287 | + | |
| 2288 | + | |
| 2289 | + | |
| 2290 | + | |
| 2291 | + | |
| 2292 | + | |
| 2293 | + | |
| 2294 | + | |
| 2295 | + | |
| 2296 | + | |
| 2297 | + | |
| 2298 | + | |
| 2299 | + | |
| 2300 | + | |
| 2301 | + | |
| 2302 | + | |
| 2303 | + | |
| 2304 | + | |
| 2305 | + | |
| 2306 | + | |
| 2307 | + | |
| 2308 | + | |
| 2309 | + | |
| 2310 | + | |
| 2311 | + | |
| 2312 | + | |
| 2313 | + | |
| 2314 | + | |
| 2315 | + | |
| 2316 | + | |
| 2317 | + | |
| 2318 | + | |
| 2319 | + | |
| 2320 | + | |
| 2321 | + | |
| 2322 | + | |
| 2323 | + | |
| 2324 | + | |
| 2325 | + | |
| 2326 | + | |
| 2327 | + | |
| 2328 | + | |
| 2329 | + | |
| 2330 | + | |
| 2331 | + | |
| 2332 | + | |
| 2333 | + | |
| 2334 | + | |
| 2335 | + | |
| 2336 | + | |
| 2337 | + | |
| 2338 | + | |
| 2339 | + | |
| 2340 | + | |
| 2341 | + | |
| 2342 | + | |
| 2343 | + | |
| 2344 | + | |
| 2345 | + | |
| 2346 | + | |
| 2347 | + | |
| 2348 | + | |
| 2349 | + | |
| 2350 | + | |
| 2351 | + | |
| 2352 | + | |
| 2353 | + | |
2286 | 2354 | | |
2287 | | - | |
2288 | 2355 | | |
2289 | 2356 | | |
2290 | 2357 | | |
| |||
2513 | 2580 | | |
2514 | 2581 | | |
2515 | 2582 | | |
2516 | | - | |
| 2583 | + | |
2517 | 2584 | | |
2518 | 2585 | | |
2519 | 2586 | | |
| |||
2539 | 2606 | | |
2540 | 2607 | | |
2541 | 2608 | | |
2542 | | - | |
| 2609 | + | |
| 2610 | + | |
| 2611 | + | |
| 2612 | + | |
| 2613 | + | |
| 2614 | + | |
| 2615 | + | |
| 2616 | + | |
| 2617 | + | |
| 2618 | + | |
| 2619 | + | |
| 2620 | + | |
| 2621 | + | |
2543 | 2622 | | |
2544 | 2623 | | |
2545 | 2624 | | |
| |||
2576 | 2655 | | |
2577 | 2656 | | |
2578 | 2657 | | |
| 2658 | + | |
| 2659 | + | |
| 2660 | + | |
| 2661 | + | |
| 2662 | + | |
2579 | 2663 | | |
2580 | 2664 | | |
2581 | 2665 | | |
| |||
2694 | 2778 | | |
2695 | 2779 | | |
2696 | 2780 | | |
| 2781 | + | |
| 2782 | + | |
| 2783 | + | |
| 2784 | + | |
| 2785 | + | |
2697 | 2786 | | |
2698 | 2787 | | |
2699 | 2788 | | |
| |||
0 commit comments