AGILE Demo

AGILE is the first system to enable asynchronous, GPU-initiated communication with NVMe SSDs. AGILE overlaps data transfers with computation to maximize parallelism and utilization. It incorporates a lightweight GPU-resident runtime, a deadlock-free asynchronous execution model, and a flexible HBM-based software cache with customizable policies. Together, these features minimize latency and overhead, delivering substantial performance gains for data-intensive workloads.

The Memory Wall Problem

AI model growth is outpacing on-package GPU memory

Modern recommendation and foundation models scale much faster than accelerator memory capacity. The result is a persistent memory wall: the model footprint keeps growing, but each GPU can only hold a small working set locally.

That mismatch pushes large deployments toward aggressive model sharding, host staging, and storage offload. Once memory becomes the bottleneck, adding more GPUs does not automatically translate to higher compute utilization.

Note: The data are approximated from Gholami, Amir, et al., “AI and the Memory Wall,” IEEE Micro, vol. 44, no. 3, 2024, pp. 33–39.

GPU-Centric I/O

Data Path and Control Path Optimizations

Case 1

Before GPUDirect Storage

1. GPU-CPU sync, telling the CPU which data to fetch
2. CPU reads data from NVMe to DRAM
3. CPU copies data from DRAM to GPU

CPU and DRAM both stay in the critical path

Case 2

With GPUDirect Storage

1. GPU-CPU sync, telling the CPU which data to fetch
2. CPU issues NVMe read command
3. NVMe DMA data directly to GPU

DRAM bypassed, but CPU still controls I/O

Case 3

GPU-Centric Storage

1. GPU issues NVMe read command
2. NVMe DMA data directly to GPU
3. No CPU or DRAM involvement

GPU owns both control and data path

Programming Model -- Synchronous I/O

Synchronous I/O

Synchronous I/O keeps the request path simple, but it stalls compute on every batch

1 for batch in loader:

2 buf = agile.read_sync(batch) blocking I/O

3 # thread needs to wait for data transfer

4 logits = run_kernel(buf)

5 commit(logits)

Sync I/O Timeline

Batch 0

read

compute

Batch 1

read

compute

Batch 2

read

compute

Requests and kernels form a staircase: each batch blocks on transfer before compute can begin.

Programming Model -- Async I/O

Asynchronous I/O

Asynchronous I/O allows compute to continue while data is being transferred, improving overall throughput

1 buffers = [slot0, slot1]

2 agile.read_sync(batch0, buffers[0])

3 for step, batch in enumerate(loader[1:]):

4 next = (step + 1) & 1

5 ticket = agile.read_async(batch, buffers[next]) prefetch next

6 logits = run_kernel(buffers[step & 1]) use current

7 ticket.wait()

8 commit(logits)

Async I/O Timeline

Batch 0

read

compute

Batch 1

prefetch

compute

Batch 2

prefetch

compute

Batch 3

prefetch

compute

Double buffering keeps one slot busy with compute while the other slot is already filling with the next batch.

01 / CTC Sweep

Demonstrate AGILE's ability to overlap computation and NVMe I/O by varying the computation-to-communication (CTC) ratio.

Note: Data are collected from ASUS Pro WS WRX90E-SAGE SE with NVIDIA RTX 5000 Ada & Samsung 9100 PRO 1TB SSD

02 / Graph Applications

Evaluate AGILE on memory-bounded graph workloads with irregular memory access patterns, showing reduced software cache and NVMe I/O overhead compared to BaM.

Note: Data are collected from ASUS Pro WS WRX90E-SAGE SE with NVIDIA RTX 5000 Ada & Samsung 9100 PRO 1TB SSD

03 / Register Report

Analyzes per-thread GPU register consumption across CUDA kernels, showing that AGILE is more lightweight and improves resource efficiency compared to BaM.

Label	Kernel	Regs	CMEM (B)

Run the register report to populate this table.

Note: Kernels are compiled for NVIDIA RTX 5000 Ada with CUDA 12.6

04 / DLRM Microbenchmark

Uses the Criteo dataset to assess large-scale recommendation inference, where AGILE optimizes embedding access from SSDs and reduces end-to-end execution time.

Note: Data are collected from Dell R750 with NVIDIA RTX 5000 Ada & Dell Ent NVMe AGN MU AIC 1.6TB SSD

Scroll to switch demos

A G I L E : AGILE: Lightweight and Efficient Asynchronous GPU-SSD Integration

AI model growth is outpacing on-package GPU memory

Data Path and Control Path Optimizations

Before GPUDirect Storage

With GPUDirect Storage

GPU-Centric Storage

Synchronous I/O

Synchronous I/O keeps the request path simple, but it stalls compute on every batch

Asynchronous I/O

Asynchronous I/O allows compute to continue while data is being transferred, improving overall throughput