Training Big Sparse Recommendation Models on Commodity Servers

Deep learning based recommendation models are one of the larget DNN workload with several TeraBytes in size and trillions of parameters. Training such big sparse recommendation models on low end GPUs is an open problem. Given the limited capacity of GPU HBM, number of GPU’s scale with embedding table size. Other option is to use commodity servers with limited number of GPU’s where embedding tables are placed on CPU main memory. But placing embedding table on CPU main memory faces system bottlenecks and slows down the overall training process.

Goal

The goal of this tutorial is to investigate the system bottlenecks associated with placing embedding on CPU main memory and optimizing the embedding placement on commodity servers without scaling the number of GPU devices. We aim to use our work at VLDB 2022.

Understanding the challenges associated with training big sparse recommendation models on commodity servers using real-world data?
How to utilize the limited GPU HBM in an efficient way for training big sparse recommendation models.
Investigate the popularity within training data and how to exploit the skewness in access patterns into embedding tables.
Traininig big sparse recommendation models on commodity servers with limited GPU devices.

Audience

Anyone looking for understanding the basics of deep learning based recommendation models and how to train such models.

Requirements

Pre-requisites

Knowledge of basic concepts of deep learning training.
Familiarity with PyTorch.

Hardware Resources

For the tutorial, we will use Google colab so only a Google account is required.

Schedule

Time (EST)	Session Details	Speaker	Slides
8:00 am	Introduction to Recommender Systems	Muhammad Adnan	Slides
8:20 am	Deep Learning based Recommendation Models (DLRM)	Prashant J. Nair	Slides
8:45 am	Challenges Associated with Training Recommendation Models	Muhammad Adnan	Slides
9:15 am	Setting Up resources for training	Muhammad Adnan
10:00 am	Coffee Break
10:20 am	Skewness in Embedding Access Pattern	Prashant J. Nair	Slides
10:40 am	Baseline DLRM Training Profiling	Muhammad Adnan	Slides
11:15 am	FAE Training	Muhammad Adnan	Slides
12:10 am	Conclusion	Prashant J. Nair