Skip to content

Commit f1e7737

Browse files
yupadhyayfacebook-github-bot
authored andcommitted
FAQ file for TorchRec (#3222)
Summary: docs: Add FAQ for TorchRec This commit introduces a new FAQ.md file to address common questions regarding TorchRec for large model and embedding training. The FAQ covers: - General concepts and use cases for TorchRec and FSDP. - Sharding strategies and distributed training in TorchRec. - Memory management and performance optimization for large embedding tables. - Integration with existing systems. - Common technical challenges encountered by users. - Best practices for model design and evaluation. The goal is to provide a comprehensive resource for users facing challenges with large-scale recommendation systems and distributed training, improving clarity and reducing common pain points. Differential Revision: D78769752
1 parent e3d5e36 commit f1e7737

File tree

1 file changed

+122
-0
lines changed

1 file changed

+122
-0
lines changed

docs/FAQ.md

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# TorchRec FAQ
2+
3+
Frequently asked questions about TorchRec
4+
5+
## Table of Contents
6+
7+
- [General Concepts](#general-concepts)
8+
- [Sharding and Distributed Training](#sharding-and-distributed-training)
9+
- [Memory Management and Performance](#memory-management-and-performance)
10+
- [Integrating with Existing Systems](#integrating-with-existing-systems)
11+
- [Technical Challenges](#technical-challenges)
12+
- [Model Design and Evaluation](#model-design-and-evaluation)
13+
14+
## General Concepts
15+
16+
### What are TorchRec and FSDP, and when should they be used?
17+
18+
**TorchRec** is a PyTorch domain library with primitives for large-scale distributed embeddings, usually used in recommendation systems. Use it when dealing with models containing massive embedding tables that exceed single-GPU memory.
19+
20+
**FSDP (Fully Sharded Data Parallel)** is a PyTorch distributed training technique that shards dense model parameters, gradients, and optimizer states across GPUs, reducing memory footprint for large models. Use it for training large language models or other general deep learning architectures that require scaling across multiple GPUs.
21+
22+
### Can TorchRec do everything FSDP can do for sparse embeddings, and vice versa?
23+
24+
- **TorchRec** offers specialized sharding strategies and optimized kernels designed for sparse embeddings, making it more efficient for this specific task.
25+
- **FSDP** can work with models containing sparse embeddings, but it might not be as optimized or feature-rich as TorchRec for this specific task. For recommendation systems, TorchRec's methods are often more memory efficient due to their focus on sparse data characteristics.
26+
- For optimal results in recommendation systems with large sparse embeddings, combine TorchRec for embeddings and FSDP for the dense parts of the model.
27+
28+
### What improvements does FSDP2 offer?
29+
30+
FSDP2 builds on FSDP1 with:
31+
- DTensor-based sharding
32+
- Per-parameter sharding for greater flexibility (e.g., partial freezing)
33+
- Enhanced memory management
34+
- Faster checkpointing
35+
- Support for mixed precision and FP8
36+
- Better composability with other parallelisms
37+
38+
### Does TorchRec support DTensor?
39+
40+
Yes, TorchRec models can benefit from DTensor support in PyTorch distributed components, like FSDP2. This improves distributed training performance, efficiency, and interoperability between TorchRec and other DTensor-based components.
41+
42+
## Sharding and Distributed Training
43+
44+
### How do you choose the best sharding strategy for embedding tables?
45+
46+
TorchRec offers multiple sharding strategies:
47+
- Table-Wise (TW)
48+
- Row-Wise (RW)
49+
- Column-Wise (CW)
50+
- Table-Wise-Row-Wise (TWRW)
51+
- Grid-Shard (GS)
52+
- Data Parallel (DP)
53+
54+
Consider factors like embedding table size, memory constraints, communication patterns, and load balancing when selecting a strategy.
55+
56+
The TorchRec Planner can automatically find an optimal sharding plan based on your hardware and settings.
57+
58+
### How does the TorchRec planner work, and can it be customized?
59+
60+
The Planner aims to balance memory and computation across devices. You can influence the planner using ParameterConstraints, providing information like pooling factors. TorchRec also features automated sharding based on cost modeling and deep reinforcement learning called AutoShard.
61+
62+
63+
## Memory Management and Performance
64+
65+
### How do you manage the memory footprint of large embedding tables?
66+
67+
- Choose an optimal sharding strategy
68+
- If GPU memory is not sufficient, TorchRec provides options to offload embeddings to CPU (UVM) and SSD memory
69+
70+
71+
## Integrating with Existing Systems
72+
73+
### Can TorchRec modules be easily converted to TorchScript for deployment and inference in C++ environments?
74+
75+
Yes, TorchRec modules can be traced and scripted for TorchSCript inference in C++ environments. However, it's recommended to script only the non-embedding layers for better performance and to handle potential limitations with sharded embedding modules in TorchScript.
76+
77+
## Technical Challenges
78+
79+
### Why are you getting row-wise alltoall errors when combining different pooling types?
80+
81+
This can occur due to incompatible sharding and pooling types, resulting in communication mismatches during data aggregation. Ensure your sharding and pooling choices align with the communication patterns required.
82+
83+
### How do you handle floating point exceptions when using quantized embeddings with float32 data types?
84+
85+
- Implement gradient clipping
86+
- Monitor gradients and weights for numerical issues
87+
- Consider using different scaling strategies like amp
88+
- Accumulate gradients over mini-batches
89+
90+
### What are best practices for handling scenarios with empty batches for EmbeddingCollection?
91+
92+
Handle empty batches by filtering them out, skipping lookups, using default values, or padding and masking them accordingly.
93+
94+
### What are common causes of issues during the forward() graph and optimizer step()?
95+
96+
- Incorrect input data format, type, or device
97+
- Invalid embedding lookups (out-of-range indices, mismatched names)
98+
- Issues in the computational graph preventing gradient flow
99+
- Incorrect optimizer setup, learning rate, or fusion settings
100+
101+
### What is the role of fused optimizers in TorchRec?
102+
TorchRec uses fused optimizers, often with DistributedModelParallel, where the optimizer update is integrated into the backward pass. This prevents the materialization of embedding gradients, leading to significant memory savings. You can also opt for a dense optimizer for more control.
103+
104+
## Model Design and Evaluation
105+
106+
### What are best practices for designing recommendation models with TorchRec?
107+
108+
- Carefully select and preprocess features
109+
- Choose suitable model architectures for your recommendation task
110+
- Leverage TorchRec components like EmbeddingBagCollection and optimized kernels
111+
- Design the model with distributed training in mind, considering sharding and communication patterns
112+
113+
### What are the most effective methods for evaluating recommendation systems built with TorchRec?
114+
115+
**Offline Evaluation**:
116+
- Use metrics like AUC, Recall@K, Precision@K, and NDCG@K
117+
- Employ train-test splits, cross-validation, and negative sampling
118+
119+
**Online Evaluation**:
120+
- Conduct A/B tests in production
121+
- Measure metrics like click-through rate, conversion rate, and user engagement
122+
- Gather user feedback

0 commit comments

Comments
 (0)