Before GPUDirect Storage
- 1. GPU-CPU sync, telling the CPU which data to fetch
- 2. CPU reads data from NVMe to DRAM
- 3. CPU copies data from DRAM to GPU
AGILE is the first system to enable asynchronous, GPU-initiated communication with NVMe SSDs. AGILE overlaps data transfers with computation to maximize parallelism and utilization. It incorporates a lightweight GPU-resident runtime, a deadlock-free asynchronous execution model, and a flexible HBM-based software cache with customizable policies. Together, these features minimize latency and overhead, delivering substantial performance gains for data-intensive workloads.
Modern recommendation and foundation models scale much faster than accelerator memory capacity. The result is a persistent memory wall: the model footprint keeps growing, but each GPU can only hold a small working set locally.
That mismatch pushes large deployments toward aggressive model sharding, host staging, and storage offload. Once memory becomes the bottleneck, adding more GPUs does not automatically translate to higher compute utilization.
Demonstrate AGILE's ability to overlap computation and NVMe I/O by varying the computation-to-communication (CTC) ratio.
Evaluate AGILE on memory-bounded graph workloads with irregular memory access patterns, showing reduced software cache and NVMe I/O overhead compared to BaM.
Analyzes per-thread GPU register consumption across CUDA kernels, showing that AGILE is more lightweight and improves resource efficiency compared to BaM.
| Label | Kernel | Regs | CMEM (B) |
|---|
Uses the Criteo dataset to assess large-scale recommendation inference, where AGILE optimizes embedding access from SSDs and reduces end-to-end execution time.