AGILE Demo

AGILE is the first system to enable asynchronous, GPU-initiated communication with NVMe SSDs. AGILE overlaps data transfers with computation to maximize parallelism and utilization. It incorporates a lightweight GPU-resident runtime, a deadlock-free asynchronous execution model, and a flexible HBM-based software cache with customizable policies. Together, these features minimize latency and overhead, delivering substantial performance gains for data-intensive workloads.

The Memory Wall Problem

AI model growth is outpacing on-package GPU memory

Modern recommendation and foundation models scale much faster than accelerator memory capacity. The result is a persistent memory wall: the model footprint keeps growing, but each GPU can only hold a small working set locally.

That mismatch pushes large deployments toward aggressive model sharding, host staging, and storage offload. Once memory becomes the bottleneck, adding more GPUs does not automatically translate to higher compute utilization.

0.01 0.1 1 10 100 1000 10000 2016 2017 2018 2019 2020 2021 2022 Parameter count (billion) Year ResNet50 Inception V4 ResNext101 DenseNet Transformer GPT-1 BERT GPT-2 Megatron LM Microsoft T-NLG GPT-3 GShard Switch Transformer Megatron-Turing P100 (12GB) TPUv2 (16GB) V100 (32GB) TPUv3 (32GB) A100-80 (80GB) A100 (40GB) H100 (80GB) 10TB Baidu RecSys Transformer Size: 410x / 2 yrs AI HW Memory: 2x / 2 yrs
Note: The data are approximated from Gholami, Amir, et al., “AI and the Memory Wall,” IEEE Micro, vol. 44, no. 3, 2024, pp. 33–39.
GPU-Centric I/O

Data Path and Control Path Optimizations

Case 1

Before GPUDirect Storage

  1. 1. GPU-CPU sync, telling the CPU which data to fetch
  2. 2. CPU reads data from NVMe to DRAM
  3. 3. CPU copies data from DRAM to GPU
CPU and DRAM both stay in the critical path
Case 2

With GPUDirect Storage

  1. 1. GPU-CPU sync, telling the CPU which data to fetch
  2. 2. CPU issues NVMe read command
  3. 3. NVMe DMA data directly to GPU
DRAM bypassed, but CPU still controls I/O
Case 3

GPU-Centric Storage

  1. 1. GPU issues NVMe read command
  2. 2. NVMe DMA data directly to GPU
  3. 3. No CPU or DRAM involvement
GPU owns both control and data path
Programming Model -- Synchronous I/O

Synchronous I/O

Synchronous I/O keeps the request path simple, but it stalls compute on every batch

1 for batch in loader:
2 buf = agile.read_sync(batch) blocking I/O
3 # thread needs to wait for data transfer
4 logits = run_kernel(buf)
5 commit(logits)
Sync I/O Timeline
Batch 0
read
compute
Batch 1
read
compute
Batch 2
read
compute
Requests and kernels form a staircase: each batch blocks on transfer before compute can begin.
Programming Model -- Async I/O

Asynchronous I/O

Asynchronous I/O allows compute to continue while data is being transferred, improving overall throughput

1 buffers = [slot0, slot1]
2 agile.read_sync(batch0, buffers[0])
3 for step, batch in enumerate(loader[1:]):
4 next = (step + 1) & 1
5 ticket = agile.read_async(batch, buffers[next]) prefetch next
6 logits = run_kernel(buffers[step & 1]) use current
7 ticket.wait()
8 commit(logits)
Async I/O Timeline
Batch 0
read
compute
Batch 1
prefetch
compute
Batch 2
prefetch
compute
Batch 3
prefetch
compute
Double buffering keeps one slot busy with compute while the other slot is already filling with the next batch.
01 / CTC Sweep

Demonstrate AGILE's ability to overlap computation and NVMe I/O by varying the computation-to-communication (CTC) ratio.

Note: Data are collected from ASUS Pro WS WRX90E-SAGE SE with NVIDIA RTX 5000 Ada & Samsung 9100 PRO 1TB SSD
02 / Graph Applications

Evaluate AGILE on memory-bounded graph workloads with irregular memory access patterns, showing reduced software cache and NVMe I/O overhead compared to BaM.

Note: Data are collected from ASUS Pro WS WRX90E-SAGE SE with NVIDIA RTX 5000 Ada & Samsung 9100 PRO 1TB SSD
03 / Register Report

Analyzes per-thread GPU register consumption across CUDA kernels, showing that AGILE is more lightweight and improves resource efficiency compared to BaM.

Label Kernel Regs CMEM (B)
Run the register report to populate this table.
Note: Kernels are compiled for NVIDIA RTX 5000 Ada with CUDA 12.0
04 / DLRM Microbenchmark

Uses the Criteo dataset to assess large-scale recommendation inference, where AGILE optimizes embedding access from SSDs and reduces end-to-end execution time.

Note: Data are collected from Dell R750 with NVIDIA RTX 5000 Ada & Dell Ent NVMe AGN MU AIC 1.6TB SSD
Scroll to switch demos