Docker GPU Computing: NVIDIA Container Toolkit and CUDA Setup
GPU-accelerated containerized workloads are becoming the standard for machine learning training, inference serving, video transcoding, and scientific computing. Docker's GPU support, powered by the NVIDIA Container Toolkit, allows containers to access GPU hardware with near-native performance while maintaining the portability and isolation benefits of containerization. This guide covers the complete setup process from driver installation through multi-GPU sharing strategies.
Prerequisites
| Component | Minimum Version | Recommended |
|---|---|---|
| NVIDIA GPU | Maxwell (GTX 900) or newer | Ampere (RTX 3000+) or newer |
| NVIDIA Driver | 470.x | 550.x or latest |
| Docker Engine | 19.03+ | 27.x+ |
| Linux Kernel | 3.10+ | 6.1+ |
| NVIDIA Container Toolkit | 1.13.0+ | Latest |
Installing the NVIDIA Container Toolkit
Step 1: Install NVIDIA Drivers
# Ubuntu/Debian
sudo apt update
sudo apt install -y nvidia-driver-550
# Or use the NVIDIA CUDA repository for latest drivers
sudo apt install -y nvidia-utils-550
# Verify driver installation
nvidia-smi
# Should display GPU info, driver version, and CUDA version
# Example output:
# +-----------------------------------------------------------------------------+
# | NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
# |-------------------------------+----------------------+----------------------+
# | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
# | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
# | 0 NVIDIA RTX 4090 Off | 00000000:01:00.0 On | Off |
# | 30% 35C P8 18W / 450W | 256MiB / 24564MiB | 0% Default |
# +-------------------------------+----------------------+----------------------+
Step 2: Install NVIDIA Container Toolkit
# Add the NVIDIA container toolkit repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# Install the toolkit
sudo apt update
sudo apt install -y nvidia-container-toolkit
# Configure Docker to use the NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=docker
# Restart Docker
sudo systemctl restart docker
Step 3: Verify GPU Access in Docker
# Test GPU access in a container
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
# Should display the same GPU info as running nvidia-smi on the host
# If this works, GPU support is properly configured
Using GPUs in Docker Containers
GPU Selection Flags
# All GPUs
docker run --gpus all myapp
# Specific number of GPUs
docker run --gpus 2 myapp
# Specific GPU by index
docker run --gpus '"device=0"' myapp
docker run --gpus '"device=0,2"' myapp
# Specific GPU by UUID
docker run --gpus '"device=GPU-12345678-abcd-efgh-ijkl-123456789012"' myapp
# Request specific capabilities
docker run --gpus '"capabilities=compute,utility"' myapp
# Available capabilities: compute, compat32, graphics, utility, video, display
CUDA Containers
NVIDIA provides official CUDA container images with different variants:
| Image Variant | Contents | Use Case | Size |
|---|---|---|---|
base |
CUDA runtime only | Running pre-built CUDA apps | ~150 MB |
runtime |
CUDA runtime + cuDNN | Running ML inference | ~1.5 GB |
devel |
Runtime + headers + nvcc compiler | Building CUDA applications | ~3.5 GB |
# Multi-stage build for CUDA application
FROM nvidia/cuda:12.4.0-devel-ubuntu22.04 AS builder
WORKDIR /app
COPY . .
RUN nvcc -O3 -o myapp main.cu
FROM nvidia/cuda:12.4.0-base-ubuntu22.04
COPY --from=builder /app/myapp /usr/local/bin/myapp
CMD ["myapp"]
TensorFlow with GPU
# TensorFlow GPU container
docker run --gpus all -it --rm \
-v $(pwd)/notebooks:/notebooks \
-p 8888:8888 \
tensorflow/tensorflow:latest-gpu-jupyter
# Verify GPU is detected
docker run --gpus all --rm tensorflow/tensorflow:latest-gpu \
python -c "import tensorflow as tf; print('GPUs:', tf.config.list_physical_devices('GPU'))"
# Dockerfile for custom TensorFlow GPU application
FROM tensorflow/tensorflow:2.16.1-gpu
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "train.py"]
# docker-compose.yml for TensorFlow training
services:
training:
build: .
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
volumes:
- ./data:/app/data
- ./models:/app/models
environment:
TF_FORCE_GPU_ALLOW_GROWTH: "true"
CUDA_VISIBLE_DEVICES: "0"
PyTorch with GPU
# PyTorch GPU container
docker run --gpus all -it --rm \
pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime \
python -c "import torch; print('CUDA available:', torch.cuda.is_available()); print('Devices:', torch.cuda.device_count())"
# Dockerfile for PyTorch application
FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
# Set environment for optimal GPU performance
ENV TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 8.9 9.0"
ENV CUDA_LAUNCH_BLOCKING=0
CMD ["python", "train.py"]
GPU Sharing Strategies
GPUs are expensive resources. Sharing them across multiple containers maximizes utilization and reduces costs.
Multi-Instance GPU (MIG)
Available on A100, A30, and H100 GPUs, MIG partitions a single GPU into up to 7 isolated instances, each with dedicated compute, memory, and cache resources.
# Enable MIG mode on an A100
sudo nvidia-smi -i 0 -mig 1
# Create GPU instances
# 3g.20gb = 3 compute slices, 20 GB memory
sudo nvidia-smi mig -i 0 -cgi 9,9,9 -C
# List MIG instances
nvidia-smi mig -i 0 -lgi
# Assign a specific MIG instance to a container
docker run --gpus '"device=0:0"' myapp # First MIG instance
docker run --gpus '"device=0:1"' myapp # Second MIG instance
docker run --gpus '"device=0:2"' myapp # Third MIG instance
Multi-Process Service (MPS)
MPS allows multiple CUDA processes to share a single GPU concurrently. Unlike MIG, MPS does not provide hardware isolation but enables finer-grained sharing on any NVIDIA GPU.
# Start MPS daemon on the host
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
nvidia-cuda-mps-control -d
# Containers using the GPU will automatically share via MPS
docker run --gpus all -e CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps \
-v /tmp/nvidia-mps:/tmp/nvidia-mps \
myapp
Time-Slicing
Time-slicing is the simplest sharing method: multiple containers access the same GPU, and the GPU scheduler time-slices between them. No special configuration is needed.
# Configure time-slicing in the NVIDIA device plugin (Kubernetes)
# For Docker, simply assign the same GPU to multiple containers:
docker run --gpus '"device=0"' -d container-a
docker run --gpus '"device=0"' -d container-b
docker run --gpus '"device=0"' -d container-c
# All three share GPU 0 via time-slicing
| Method | GPU Required | Isolation | Overhead | Max Instances |
|---|---|---|---|---|
| MIG | A100/A30/H100 | Hardware | None | 7 per GPU |
| MPS | Any NVIDIA | Partial (process) | Low | 48 clients |
| Time-slicing | Any NVIDIA | None | Context switch | Unlimited |
Docker Compose GPU Configuration
Docker Compose supports GPU allocation through the deploy.resources.reservations.devices syntax:
services:
# ML training service with all GPUs
trainer:
image: pytorch/pytorch:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
- ./data:/data
- ./checkpoints:/checkpoints
# Inference service with a single GPU
inference:
image: mymodel-serve:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
limits:
memory: 8G
ports:
- "8501:8501"
# Jupyter notebook for development
jupyter:
image: jupyter/tensorflow-notebook:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu, compute, utility]
ports:
- "8888:8888"
volumes:
- ./notebooks:/home/jovyan/work
Monitoring GPU Usage
# Basic GPU monitoring
nvidia-smi dmon -d 1
# Displays GPU utilization, memory, temperature, power every second
# Watch GPU usage continuously
watch -n 1 nvidia-smi
# Inside a container
docker exec mygpu-container nvidia-smi
# Export GPU metrics to Prometheus
# Use dcgm-exporter (NVIDIA Data Center GPU Manager)
docker run -d --gpus all --rm \
-p 9400:9400 \
nvcr.io/nvidia/k8s/dcgm-exporter:latest
# docker-compose.yml for GPU monitoring stack
services:
dcgm-exporter:
image: nvcr.io/nvidia/k8s/dcgm-exporter:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
ports:
- "9400:9400"
restart: unless-stopped
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
# prometheus.yml
scrape_configs:
- job_name: 'gpu-metrics'
static_configs:
- targets: ['dcgm-exporter:9400']
Key GPU Metrics to Monitor
| Metric | Description | Alert Threshold |
|---|---|---|
| GPU Utilization | Percentage of time GPU cores are active | < 20% (underutilized) or > 95% (saturated) |
| Memory Utilization | GPU memory usage vs. total | > 90% (OOM risk) |
| Temperature | GPU die temperature | > 85 degrees C |
| Power Usage | Current power draw vs. TDP | > 95% of TDP sustained |
| ECC Errors | Memory error corrections | Any uncorrectable errors |
| PCIe Throughput | Data transfer rate host-GPU | Bottleneck if sustained near max |
Multi-GPU and Distributed Training
# Run a container with multiple specific GPUs
docker run --gpus '"device=0,1,2,3"' \
-e NCCL_DEBUG=INFO \
-e NCCL_SOCKET_IFNAME=eth0 \
--shm-size=16g \
pytorch/pytorch:latest \
python -m torch.distributed.launch --nproc_per_node=4 train.py
--shm-size to at least 1 GB, or preferably 8-16 GB for multi-GPU training workloads.
Cloud GPU Instances
| Provider | Instance Type | GPU | Approximate Cost |
|---|---|---|---|
| AWS | p4d.24xlarge | 8x A100 (40 GB) | $32/hr |
| AWS | g5.xlarge | 1x A10G (24 GB) | $1.01/hr |
| GCP | a2-highgpu-1g | 1x A100 (40 GB) | $3.67/hr |
| Lambda Labs | 1x A100 | 1x A100 (80 GB) | $1.10/hr |
| Vast.ai | Community | RTX 4090 (24 GB) | $0.20-0.40/hr |
Looking ahead: The GPU computing landscape in Docker is evolving rapidly. NVIDIA's upcoming multi-tenant GPU features, improved MIG flexibility, and container-native GPU scheduling will continue to make GPU sharing more efficient and accessible for containerized workloads.