Skip to content

Commit 01d66e9

Browse files
yupadhyayfacebook-github-bot
authored andcommitted
FAQ file for TorchRec
Summary: docs: Add FAQ for TorchRec This commit introduces a new FAQ.md file to address common questions regarding TorchRec for large model and embedding training. The FAQ covers: - General concepts and use cases for TorchRec and FSDP. - Sharding strategies and distributed training in TorchRec. - Memory management and performance optimization for large embedding tables. - Integration with existing systems. - Common technical challenges encountered by users. - Best practices for model design and evaluation. The goal is to provide a comprehensive resource for users facing challenges with large-scale recommendation systems and distributed training, improving clarity and reducing common pain points. Differential Revision: D78769752
1 parent 01f8654 commit 01d66e9

File tree

1 file changed

+136
-0
lines changed

1 file changed

+136
-0
lines changed

docs/FAQ.md

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
# TorchRec FAQ
2+
3+
Frequently asked questions about TorchRec
4+
5+
## Table of Contents
6+
7+
- [General Concepts](#general-concepts)
8+
- [Sharding and Distributed Training](#sharding-and-distributed-training)
9+
- [Memory Management and Performance](#memory-management-and-performance)
10+
- [Integrating with Existing Systems](#integrating-with-existing-systems)
11+
- [Technical Challenges](#technical-challenges)
12+
- [Model Design and Evaluation](#model-design-and-evaluation)
13+
14+
## General Concepts
15+
16+
### What are TorchRec and FSDP, and when should they be used?
17+
18+
**TorchRec** is a PyTorch domain library with primitives for large-scale distributed embeddings, particularly for recommendation systems. Use it when dealing with models containing massive embedding tables that exceed single-GPU memory.
19+
20+
**FSDP (Fully Sharded Data Parallel)** is a PyTorch distributed training technique that shards model parameters, gradients, and optimizer states across GPUs, reducing memory footprint for large models. Use it for training large language models or other general deep learning architectures that require scaling across multiple GPUs.
21+
22+
### Can TorchRec do everything FSDP can do for sparse embeddings, and vice versa?
23+
24+
- **TorchRec** offers specialized sharding strategies and optimized kernels designed for sparse embeddings, making it more efficient for this specific task.
25+
- **FSDP** can work with models containing sparse embeddings, but it might not be as optimized or feature-rich as TorchRec for this specific task. For recommendation systems, TorchRec's methods are often more memory efficient due to their focus on sparse data characteristics.
26+
- For optimal results in recommendation systems with large sparse embeddings, combine TorchRec for embeddings and FSDP for the dense parts of the model.
27+
28+
### What improvements does FSDP2 offer?
29+
30+
FSDP2 builds on FSDP1 with:
31+
- DTensor-based sharding
32+
- Per-parameter sharding for greater flexibility (e.g., partial freezing)
33+
- Enhanced memory management
34+
- Faster checkpointing
35+
- Support for mixed precision and FP8
36+
- Better composability with other parallelisms
37+
38+
### Does TorchRec support DTensor?
39+
40+
Yes, TorchRec models can benefit from DTensor support in PyTorch distributed components, like FSDP2. This improves distributed training performance, efficiency, and interoperability between TorchRec and other DTensor-based components.
41+
42+
## Sharding and Distributed Training
43+
44+
### How do you choose the best sharding strategy for embedding tables?
45+
46+
TorchRec offers multiple sharding strategies:
47+
- Table-Wise (TW)
48+
- Row-Wise (RW)
49+
- Column-Wise (CW)
50+
- Table-Wise-Row-Wise (TWRW)
51+
- Grid-Shard (GS)
52+
- Data Parallel (DP)
53+
54+
Consider factors like embedding table size, memory constraints, communication patterns, and load balancing when selecting a strategy.
55+
56+
The TorchRec Planner can automatically find an optimal sharding plan based on your hardware and settings.
57+
58+
### How does the TorchRec planner work, and can it be customized?
59+
60+
The Planner aims to balance memory and computation across devices. You can influence the planner using ParameterConstraints, providing information like pooling factors. TorchRec also features automated sharding based on cost modeling and deep reinforcement learning called AutoShard.
61+
62+
### How do you effectively debug and optimize distributed training with large embedding tables?
63+
64+
- Use memory and communication profiling tools like `torch.cuda.memory_summary` and `torch.distributed.profiler`
65+
- Debug with PyTorch's distributed debugging tools
66+
- Start with smaller scale testing
67+
- Leverage TorchRec features like Table Batched Embedding (TBE) and optimizer fusion
68+
69+
### What are some limitations of TorchRec for very large embedding tables?
70+
71+
- **Extremely Dynamic Embedding Tables (Dynamic IDs)**: TorchRec might struggle with frequently adding or removing a very large number of new IDs dynamically.
72+
- **Automated Large-Scale Table Merging**: Manually configuring embedding tables can be labor-intensive.
73+
- **Cross-Node Communication Overhead**: Scaling to many GPUs across multiple nodes can increase communication overhead.
74+
75+
## Memory Management and Performance
76+
77+
### How do you manage the memory footprint of large embedding tables?
78+
79+
- Choose an optimal sharding strategy
80+
- Offload embeddings to CPU memory if GPU memory is limited
81+
- Reduce precision using quantization (e.g., float16, int8)
82+
- Adjust embedding dimensions or remove unused embeddings
83+
- Utilize Caching and Unified Virtual Memory (UVM) to manage data between GPU and CPU
84+
85+
## Integrating with Existing Systems
86+
87+
### Can TorchRec modules be easily converted to TorchScript for deployment and inference in C++ environments?
88+
89+
Yes, TorchRec supports converting trained models to TorchScript for efficient inference in C++ environments. However, it's recommended to script only the non-embedding layers for better performance and to handle potential limitations with sharded embedding modules in TorchScript.
90+
91+
## Technical Challenges
92+
93+
### Why are you getting row-wise alltoall errors when combining different pooling types?
94+
95+
This can occur due to incompatible sharding and pooling types, resulting in communication mismatches during data aggregation. Ensure your sharding and pooling choices align with the communication patterns required.
96+
97+
### How do you handle floating point exceptions when using quantized embeddings with float32 data types?
98+
99+
- Implement gradient clipping
100+
- Monitor gradients and weights for numerical issues
101+
- Consider using different scaling strategies like amp
102+
- Accumulate gradients over mini-batches
103+
104+
### What are best practices for handling scenarios with empty batches for EmbeddingCollection?
105+
106+
Handle empty batches by filtering them out, skipping lookups, using default values, or padding and masking them accordingly.
107+
108+
### What are common causes of issues during the forward() graph and optimizer step()?
109+
110+
- Incorrect input data format, type, or device
111+
- Invalid embedding lookups (out-of-range indices, mismatched names)
112+
- Issues in the computational graph preventing gradient flow
113+
- Incorrect optimizer setup, learning rate, or fusion settings
114+
115+
### What is the role of fused optimizers in TorchRec?
116+
TorchRec uses fused optimizers, often with DistributedModelParallel, where the optimizer update is integrated into the backward pass. This prevents the materialization of embedding gradients, leading to significant memory savings. You can also opt for a dense optimizer for more control.
117+
118+
## Model Design and Evaluation
119+
120+
### What are best practices for designing recommendation models with TorchRec?
121+
122+
- Carefully select and preprocess features
123+
- Choose suitable model architectures for your recommendation task
124+
- Leverage TorchRec components like EmbeddingBagCollection and optimized kernels
125+
- Design the model with distributed training in mind, considering sharding and communication patterns
126+
127+
### What are the most effective methods for evaluating recommendation systems built with TorchRec?
128+
129+
**Offline Evaluation**:
130+
- Use metrics like AUC, Recall@K, Precision@K, and NDCG@K
131+
- Employ train-test splits, cross-validation, and negative sampling
132+
133+
**Online Evaluation**:
134+
- Conduct A/B tests in production
135+
- Measure metrics like click-through rate, conversion rate, and user engagement
136+
- Gather user feedback

0 commit comments

Comments
 (0)