Guides

Docker GPU Computing: NVIDIA Container Toolkit and CUDA Setup

June 8, 2025 · 19 min read

GPU-accelerated containerized workloads are becoming the standard for machine learning training, inference serving, video transcoding, and scientific computing. Docker's GPU support, powered by the NVIDIA Container Toolkit, allows containers to access GPU hardware with near-native performance while maintaining the portability and isolation benefits of containerization. This guide covers the complete setup process from driver installation through multi-GPU sharing strategies.

Prerequisites

Component	Minimum Version	Recommended
NVIDIA GPU	Maxwell (GTX 900) or newer	Ampere (RTX 3000+) or newer
NVIDIA Driver	470.x	550.x or latest
Docker Engine	19.03+	27.x+
Linux Kernel	3.10+	6.1+
NVIDIA Container Toolkit	1.13.0+	Latest

Installing the NVIDIA Container Toolkit

Step 1: Install NVIDIA Drivers

# Ubuntu/Debian
sudo apt update
sudo apt install -y nvidia-driver-550

# Or use the NVIDIA CUDA repository for latest drivers
sudo apt install -y nvidia-utils-550

# Verify driver installation
nvidia-smi
# Should display GPU info, driver version, and CUDA version

# Example output:
# +-----------------------------------------------------------------------------+
# | NVIDIA-SMI 550.54.15    Driver Version: 550.54.15    CUDA Version: 12.4     |
# |-------------------------------+----------------------+----------------------+
# | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
# | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
# |   0  NVIDIA RTX 4090    Off   | 00000000:01:00.0  On |                  Off |
# | 30%   35C    P8    18W / 450W |    256MiB / 24564MiB |      0%      Default |
# +-------------------------------+----------------------+----------------------+

Step 2: Install NVIDIA Container Toolkit

# Add the NVIDIA container toolkit repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install the toolkit
sudo apt update
sudo apt install -y nvidia-container-toolkit

# Configure Docker to use the NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=docker

# Restart Docker
sudo systemctl restart docker

Step 3: Verify GPU Access in Docker

# Test GPU access in a container
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

# Should display the same GPU info as running nvidia-smi on the host
# If this works, GPU support is properly configured

Using GPUs in Docker Containers

GPU Selection Flags

# All GPUs
docker run --gpus all myapp

# Specific number of GPUs
docker run --gpus 2 myapp

# Specific GPU by index
docker run --gpus '"device=0"' myapp
docker run --gpus '"device=0,2"' myapp

# Specific GPU by UUID
docker run --gpus '"device=GPU-12345678-abcd-efgh-ijkl-123456789012"' myapp

# Request specific capabilities
docker run --gpus '"capabilities=compute,utility"' myapp
# Available capabilities: compute, compat32, graphics, utility, video, display

CUDA Containers

NVIDIA provides official CUDA container images with different variants:

Image Variant	Contents	Use Case	Size
`base`	CUDA runtime only	Running pre-built CUDA apps	~150 MB
`runtime`	CUDA runtime + cuDNN	Running ML inference	~1.5 GB
`devel`	Runtime + headers + nvcc compiler	Building CUDA applications	~3.5 GB

# Multi-stage build for CUDA application
FROM nvidia/cuda:12.4.0-devel-ubuntu22.04 AS builder
WORKDIR /app
COPY . .
RUN nvcc -O3 -o myapp main.cu

FROM nvidia/cuda:12.4.0-base-ubuntu22.04
COPY --from=builder /app/myapp /usr/local/bin/myapp
CMD ["myapp"]

TensorFlow with GPU

# TensorFlow GPU container
docker run --gpus all -it --rm \
  -v $(pwd)/notebooks:/notebooks \
  -p 8888:8888 \
  tensorflow/tensorflow:latest-gpu-jupyter

# Verify GPU is detected
docker run --gpus all --rm tensorflow/tensorflow:latest-gpu \
  python -c "import tensorflow as tf; print('GPUs:', tf.config.list_physical_devices('GPU'))"

# Dockerfile for custom TensorFlow GPU application
FROM tensorflow/tensorflow:2.16.1-gpu

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .
CMD ["python", "train.py"]

# docker-compose.yml for TensorFlow training
services:
  training:
    build: .
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    volumes:
      - ./data:/app/data
      - ./models:/app/models
    environment:
      TF_FORCE_GPU_ALLOW_GROWTH: "true"
      CUDA_VISIBLE_DEVICES: "0"

PyTorch with GPU

# PyTorch GPU container
docker run --gpus all -it --rm \
  pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime \
  python -c "import torch; print('CUDA available:', torch.cuda.is_available()); print('Devices:', torch.cuda.device_count())"

# Dockerfile for PyTorch application
FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

# Set environment for optimal GPU performance
ENV TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 8.9 9.0"
ENV CUDA_LAUNCH_BLOCKING=0

CMD ["python", "train.py"]

GPU Sharing Strategies

GPUs are expensive resources. Sharing them across multiple containers maximizes utilization and reduces costs.

Multi-Instance GPU (MIG)

Available on A100, A30, and H100 GPUs, MIG partitions a single GPU into up to 7 isolated instances, each with dedicated compute, memory, and cache resources.

# Enable MIG mode on an A100
sudo nvidia-smi -i 0 -mig 1

# Create GPU instances
# 3g.20gb = 3 compute slices, 20 GB memory
sudo nvidia-smi mig -i 0 -cgi 9,9,9 -C

# List MIG instances
nvidia-smi mig -i 0 -lgi

# Assign a specific MIG instance to a container
docker run --gpus '"device=0:0"' myapp   # First MIG instance
docker run --gpus '"device=0:1"' myapp   # Second MIG instance
docker run --gpus '"device=0:2"' myapp   # Third MIG instance

Multi-Process Service (MPS)

MPS allows multiple CUDA processes to share a single GPU concurrently. Unlike MIG, MPS does not provide hardware isolation but enables finer-grained sharing on any NVIDIA GPU.

# Start MPS daemon on the host
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
nvidia-cuda-mps-control -d

# Containers using the GPU will automatically share via MPS
docker run --gpus all -e CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps \
  -v /tmp/nvidia-mps:/tmp/nvidia-mps \
  myapp

Time-Slicing

Time-slicing is the simplest sharing method: multiple containers access the same GPU, and the GPU scheduler time-slices between them. No special configuration is needed.

# Configure time-slicing in the NVIDIA device plugin (Kubernetes)
# For Docker, simply assign the same GPU to multiple containers:
docker run --gpus '"device=0"' -d container-a
docker run --gpus '"device=0"' -d container-b
docker run --gpus '"device=0"' -d container-c
# All three share GPU 0 via time-slicing

Method	GPU Required	Isolation	Overhead	Max Instances
MIG	A100/A30/H100	Hardware	None	7 per GPU
MPS	Any NVIDIA	Partial (process)	Low	48 clients
Time-slicing	Any NVIDIA	None	Context switch	Unlimited

Docker Compose GPU Configuration

Docker Compose supports GPU allocation through the deploy.resources.reservations.devices syntax:

services:
  # ML training service with all GPUs
  trainer:
    image: pytorch/pytorch:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ./data:/data
      - ./checkpoints:/checkpoints

  # Inference service with a single GPU
  inference:
    image: mymodel-serve:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
        limits:
          memory: 8G
    ports:
      - "8501:8501"

  # Jupyter notebook for development
  jupyter:
    image: jupyter/tensorflow-notebook:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu, compute, utility]
    ports:
      - "8888:8888"
    volumes:
      - ./notebooks:/home/jovyan/work

Monitoring GPU Usage

# Basic GPU monitoring
nvidia-smi dmon -d 1
# Displays GPU utilization, memory, temperature, power every second

# Watch GPU usage continuously
watch -n 1 nvidia-smi

# Inside a container
docker exec mygpu-container nvidia-smi

# Export GPU metrics to Prometheus
# Use dcgm-exporter (NVIDIA Data Center GPU Manager)
docker run -d --gpus all --rm \
  -p 9400:9400 \
  nvcr.io/nvidia/k8s/dcgm-exporter:latest

# docker-compose.yml for GPU monitoring stack
services:
  dcgm-exporter:
    image: nvcr.io/nvidia/k8s/dcgm-exporter:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ports:
      - "9400:9400"
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin

# prometheus.yml
scrape_configs:
  - job_name: 'gpu-metrics'
    static_configs:
      - targets: ['dcgm-exporter:9400']

Key GPU Metrics to Monitor

Metric	Description	Alert Threshold
GPU Utilization	Percentage of time GPU cores are active	< 20% (underutilized) or > 95% (saturated)
Memory Utilization	GPU memory usage vs. total	> 90% (OOM risk)
Temperature	GPU die temperature	> 85 degrees C
Power Usage	Current power draw vs. TDP	> 95% of TDP sustained
ECC Errors	Memory error corrections	Any uncorrectable errors
PCIe Throughput	Data transfer rate host-GPU	Bottleneck if sustained near max

Multi-GPU and Distributed Training

# Run a container with multiple specific GPUs
docker run --gpus '"device=0,1,2,3"' \
  -e NCCL_DEBUG=INFO \
  -e NCCL_SOCKET_IFNAME=eth0 \
  --shm-size=16g \
  pytorch/pytorch:latest \
  python -m torch.distributed.launch --nproc_per_node=4 train.py

Warning: Multi-GPU training with NCCL (NVIDIA Collective Communications Library) requires adequate shared memory. The default Docker shared memory size (64 MB) is insufficient. Always set --shm-size to at least 1 GB, or preferably 8-16 GB for multi-GPU training workloads.

Cloud GPU Instances

Provider	Instance Type	GPU	Approximate Cost
AWS	p4d.24xlarge	8x A100 (40 GB)	$32/hr
AWS	g5.xlarge	1x A10G (24 GB)	$1.01/hr
GCP	a2-highgpu-1g	1x A100 (40 GB)	$3.67/hr
Lambda Labs	1x A100	1x A100 (80 GB)	$1.10/hr
Vast.ai	Community	RTX 4090 (24 GB)	$0.20-0.40/hr

Tip: Containerizing GPU workloads makes it easy to move between local development (with a consumer GPU) and cloud training (with data center GPUs) without changing your code. Build your Docker image once, and run it anywhere there is an NVIDIA GPU. Management platforms like usulnet help track GPU resource usage across multiple nodes in your infrastructure.

Looking ahead: The GPU computing landscape in Docker is evolving rapidly. NVIDIA's upcoming multi-tenant GPU features, improved MIG flexibility, and container-native GPU scheduling will continue to make GPU sharing more efficient and accessible for containerized workloads.

Prerequisites

Installing the NVIDIA Container Toolkit

Step 1: Install NVIDIA Drivers

Step 2: Install NVIDIA Container Toolkit

Step 3: Verify GPU Access in Docker

Using GPUs in Docker Containers

GPU Selection Flags

CUDA Containers

TensorFlow with GPU

PyTorch with GPU

GPU Sharing Strategies

Multi-Instance GPU (MIG)

Multi-Process Service (MPS)

Time-Slicing

Docker Compose GPU Configuration

Monitoring GPU Usage

Key GPU Metrics to Monitor

Multi-GPU and Distributed Training

Cloud GPU Instances

Related Articles

Docker Resource Limits: CPU, Memory and I/O Constraints Explained

Docker Infrastructure Cost Optimization: Running More with Less

Docker Monitoring Guide: Prometheus, Grafana and Container Metrics