Training Big Sparse Recommendation Models on Commodity Servers
Deep learning based recommendation models are one of the larget DNN workload with several TeraBytes in size and trillions of parameters. Training such big sparse recommendation models on low end GPUs is an open problem. Given the limited capacity of GPU HBM, number of GPU’s scale with embedding table size. Other option is to use commodity servers with limited number of GPU’s where embedding tables are placed on CPU main memory. But placing embedding table on CPU main memory faces system bottlenecks and slows down the overall training process.
Goal
The goal of this tutorial is to investigate the system bottlenecks associated with placing embedding on CPU main memory and optimizing the embedding placement on commodity servers without scaling the number of GPU devices. We aim to use our work at VLDB 2022.
- Understanding the challenges associated with training big sparse recommendation models on commodity servers using real-world data?
- How to utilize the limited GPU HBM in an efficient way for training big sparse recommendation models.
- Investigate the popularity within training data and how to exploit the skewness in access patterns into embedding tables.
- Traininig big sparse recommendation models on commodity servers with limited GPU devices.
Audience
Anyone looking for understanding the basics of deep learning based recommendation models and how to train such models.
Requirements
Pre-requisites
- Knowledge of basic concepts of deep learning training.
- Familiarity with PyTorch.
Hardware Resources
- For the tutorial, we will use Google colab so only a Google account is required.
Schedule
Time (EST) | Session Details | Speaker | Slides |
---|---|---|---|
8:00 am | Introduction to Recommender Systems | Muhammad Adnan | Slides |
8:20 am | Deep Learning based Recommendation Models (DLRM) | Prashant J. Nair | Slides |
8:45 am | Challenges Associated with Training Recommendation Models | Muhammad Adnan | Slides |
9:15 am | Setting Up resources for training | Muhammad Adnan | |
10:00 am | Coffee Break | ||
10:20 am | Skewness in Embedding Access Pattern | Prashant J. Nair | Slides |
10:40 am | Baseline DLRM Training Profiling | Muhammad Adnan | Slides |
11:15 am | FAE Training | Muhammad Adnan | Slides |
12:10 am | Conclusion | Prashant J. Nair |