[TETC] Near-Memory Computing with Compressed Embedding Table for Perso…

SMRL 0 234

Jeongmin Lim, Young Geun Kim, Sung Woo Chung, Farinaz Koushanfar, and Joonho Kong, "Near-Memory Computing with Compressed Embedding Table for Personalized Recommendation",  IEEE Transactions on Emerging Topics in Computing, Accepted. 

 

Abstract

Deep learning (DL)-based recommendation models play an important role in many real-world applications. However, an embedding layer, which is a key part of the DL-based recommendation models, requires sparse memory accesses to a very large memory space followed by the pooling operations (i.e., reduction operations). It makes the system overprovision memory capacity for model deployment. Moreover,  ith conventional CPU-based architecture, it is difficult to exploit the locality, causing a huge burden for data transfer between the CPU and memory. To resolve this problem, we propose an embedding vector element quantization and compression method to reduce the memory footprint (capacity) required by the embedding tables. In addition, to reduce the amount of data transfer and memory access, we propose near-memory acceleration hardware with an SRAM buffer that stores the frequently accessed embedding vectors. Our quantization and compression method results in compression ratios of 3.95–4.14 for  embedding tables in widely used datasets while negligibly affecting the inference accuracy. Our acceleration technique with 3D stacked DRAM memories, which facilitates the near-memory processing in the logic die with high DRAM bandwidth, leads to 4.9×–5.4× embedding layer speedup as compared to the 8-core CPU-based execution while reducing the memory energy consumption by 5.9×–12.1×, on average.​

Comments