GPU-accelerated containerized workloads are becoming the standard for machine learning training, inference serving, video transcoding, and scientific computing. Docker's GPU support, powered by the NVIDIA Container Toolkit, allows containers to access GPU hardware with near-native performance while maintaining the portability and isolation benefits of containerization. This guide covers the complete setup process from driver installation through multi-GPU sharing strategies.

Prerequisites

Component Minimum Version Recommended
NVIDIA GPU Maxwell (GTX 900) or newer Ampere (RTX 3000+) or newer
NVIDIA Driver 470.x 550.x or latest
Docker Engine 19.03+ 27.x+
Linux Kernel 3.10+ 6.1+
NVIDIA Container Toolkit 1.13.0+ Latest

Installing the NVIDIA Container Toolkit

Step 1: Install NVIDIA Drivers

# Ubuntu/Debian
sudo apt update
sudo apt install -y nvidia-driver-550

# Or use the NVIDIA CUDA repository for latest drivers
sudo apt install -y nvidia-utils-550

# Verify driver installation
nvidia-smi
# Should display GPU info, driver version, and CUDA version

# Example output:
# +-----------------------------------------------------------------------------+
# | NVIDIA-SMI 550.54.15    Driver Version: 550.54.15    CUDA Version: 12.4     |
# |-------------------------------+----------------------+----------------------+
# | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
# | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
# |   0  NVIDIA RTX 4090    Off   | 00000000:01:00.0  On |                  Off |
# | 30%   35C    P8    18W / 450W |    256MiB / 24564MiB |      0%      Default |
# +-------------------------------+----------------------+----------------------+

Step 2: Install NVIDIA Container Toolkit

# Add the NVIDIA container toolkit repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install the toolkit
sudo apt update
sudo apt install -y nvidia-container-toolkit

# Configure Docker to use the NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=docker

# Restart Docker
sudo systemctl restart docker

Step 3: Verify GPU Access in Docker

# Test GPU access in a container
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

# Should display the same GPU info as running nvidia-smi on the host
# If this works, GPU support is properly configured

Using GPUs in Docker Containers

GPU Selection Flags

# All GPUs
docker run --gpus all myapp

# Specific number of GPUs
docker run --gpus 2 myapp

# Specific GPU by index
docker run --gpus '"device=0"' myapp
docker run --gpus '"device=0,2"' myapp

# Specific GPU by UUID
docker run --gpus '"device=GPU-12345678-abcd-efgh-ijkl-123456789012"' myapp

# Request specific capabilities
docker run --gpus '"capabilities=compute,utility"' myapp
# Available capabilities: compute, compat32, graphics, utility, video, display

CUDA Containers

NVIDIA provides official CUDA container images with different variants:

Image Variant Contents Use Case Size
base CUDA runtime only Running pre-built CUDA apps ~150 MB
runtime CUDA runtime + cuDNN Running ML inference ~1.5 GB
devel Runtime + headers + nvcc compiler Building CUDA applications ~3.5 GB
# Multi-stage build for CUDA application
FROM nvidia/cuda:12.4.0-devel-ubuntu22.04 AS builder
WORKDIR /app
COPY . .
RUN nvcc -O3 -o myapp main.cu

FROM nvidia/cuda:12.4.0-base-ubuntu22.04
COPY --from=builder /app/myapp /usr/local/bin/myapp
CMD ["myapp"]

TensorFlow with GPU

# TensorFlow GPU container
docker run --gpus all -it --rm \
  -v $(pwd)/notebooks:/notebooks \
  -p 8888:8888 \
  tensorflow/tensorflow:latest-gpu-jupyter

# Verify GPU is detected
docker run --gpus all --rm tensorflow/tensorflow:latest-gpu \
  python -c "import tensorflow as tf; print('GPUs:', tf.config.list_physical_devices('GPU'))"
# Dockerfile for custom TensorFlow GPU application
FROM tensorflow/tensorflow:2.16.1-gpu

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .
CMD ["python", "train.py"]
# docker-compose.yml for TensorFlow training
services:
  training:
    build: .
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    volumes:
      - ./data:/app/data
      - ./models:/app/models
    environment:
      TF_FORCE_GPU_ALLOW_GROWTH: "true"
      CUDA_VISIBLE_DEVICES: "0"

PyTorch with GPU

# PyTorch GPU container
docker run --gpus all -it --rm \
  pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime \
  python -c "import torch; print('CUDA available:', torch.cuda.is_available()); print('Devices:', torch.cuda.device_count())"
# Dockerfile for PyTorch application
FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

# Set environment for optimal GPU performance
ENV TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 8.9 9.0"
ENV CUDA_LAUNCH_BLOCKING=0

CMD ["python", "train.py"]

GPU Sharing Strategies

GPUs are expensive resources. Sharing them across multiple containers maximizes utilization and reduces costs.

Multi-Instance GPU (MIG)

Available on A100, A30, and H100 GPUs, MIG partitions a single GPU into up to 7 isolated instances, each with dedicated compute, memory, and cache resources.

# Enable MIG mode on an A100
sudo nvidia-smi -i 0 -mig 1

# Create GPU instances
# 3g.20gb = 3 compute slices, 20 GB memory
sudo nvidia-smi mig -i 0 -cgi 9,9,9 -C

# List MIG instances
nvidia-smi mig -i 0 -lgi

# Assign a specific MIG instance to a container
docker run --gpus '"device=0:0"' myapp   # First MIG instance
docker run --gpus '"device=0:1"' myapp   # Second MIG instance
docker run --gpus '"device=0:2"' myapp   # Third MIG instance

Multi-Process Service (MPS)

MPS allows multiple CUDA processes to share a single GPU concurrently. Unlike MIG, MPS does not provide hardware isolation but enables finer-grained sharing on any NVIDIA GPU.

# Start MPS daemon on the host
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
nvidia-cuda-mps-control -d

# Containers using the GPU will automatically share via MPS
docker run --gpus all -e CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps \
  -v /tmp/nvidia-mps:/tmp/nvidia-mps \
  myapp

Time-Slicing

Time-slicing is the simplest sharing method: multiple containers access the same GPU, and the GPU scheduler time-slices between them. No special configuration is needed.

# Configure time-slicing in the NVIDIA device plugin (Kubernetes)
# For Docker, simply assign the same GPU to multiple containers:
docker run --gpus '"device=0"' -d container-a
docker run --gpus '"device=0"' -d container-b
docker run --gpus '"device=0"' -d container-c
# All three share GPU 0 via time-slicing
Method GPU Required Isolation Overhead Max Instances
MIG A100/A30/H100 Hardware None 7 per GPU
MPS Any NVIDIA Partial (process) Low 48 clients
Time-slicing Any NVIDIA None Context switch Unlimited

Docker Compose GPU Configuration

Docker Compose supports GPU allocation through the deploy.resources.reservations.devices syntax:

services:
  # ML training service with all GPUs
  trainer:
    image: pytorch/pytorch:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ./data:/data
      - ./checkpoints:/checkpoints

  # Inference service with a single GPU
  inference:
    image: mymodel-serve:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
        limits:
          memory: 8G
    ports:
      - "8501:8501"

  # Jupyter notebook for development
  jupyter:
    image: jupyter/tensorflow-notebook:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu, compute, utility]
    ports:
      - "8888:8888"
    volumes:
      - ./notebooks:/home/jovyan/work

Monitoring GPU Usage

# Basic GPU monitoring
nvidia-smi dmon -d 1
# Displays GPU utilization, memory, temperature, power every second

# Watch GPU usage continuously
watch -n 1 nvidia-smi

# Inside a container
docker exec mygpu-container nvidia-smi

# Export GPU metrics to Prometheus
# Use dcgm-exporter (NVIDIA Data Center GPU Manager)
docker run -d --gpus all --rm \
  -p 9400:9400 \
  nvcr.io/nvidia/k8s/dcgm-exporter:latest
# docker-compose.yml for GPU monitoring stack
services:
  dcgm-exporter:
    image: nvcr.io/nvidia/k8s/dcgm-exporter:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ports:
      - "9400:9400"
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
# prometheus.yml
scrape_configs:
  - job_name: 'gpu-metrics'
    static_configs:
      - targets: ['dcgm-exporter:9400']

Key GPU Metrics to Monitor

Metric Description Alert Threshold
GPU Utilization Percentage of time GPU cores are active < 20% (underutilized) or > 95% (saturated)
Memory Utilization GPU memory usage vs. total > 90% (OOM risk)
Temperature GPU die temperature > 85 degrees C
Power Usage Current power draw vs. TDP > 95% of TDP sustained
ECC Errors Memory error corrections Any uncorrectable errors
PCIe Throughput Data transfer rate host-GPU Bottleneck if sustained near max

Multi-GPU and Distributed Training

# Run a container with multiple specific GPUs
docker run --gpus '"device=0,1,2,3"' \
  -e NCCL_DEBUG=INFO \
  -e NCCL_SOCKET_IFNAME=eth0 \
  --shm-size=16g \
  pytorch/pytorch:latest \
  python -m torch.distributed.launch --nproc_per_node=4 train.py
Warning: Multi-GPU training with NCCL (NVIDIA Collective Communications Library) requires adequate shared memory. The default Docker shared memory size (64 MB) is insufficient. Always set --shm-size to at least 1 GB, or preferably 8-16 GB for multi-GPU training workloads.

Cloud GPU Instances

Provider Instance Type GPU Approximate Cost
AWS p4d.24xlarge 8x A100 (40 GB) $32/hr
AWS g5.xlarge 1x A10G (24 GB) $1.01/hr
GCP a2-highgpu-1g 1x A100 (40 GB) $3.67/hr
Lambda Labs 1x A100 1x A100 (80 GB) $1.10/hr
Vast.ai Community RTX 4090 (24 GB) $0.20-0.40/hr
Tip: Containerizing GPU workloads makes it easy to move between local development (with a consumer GPU) and cloud training (with data center GPUs) without changing your code. Build your Docker image once, and run it anywhere there is an NVIDIA GPU. Management platforms like usulnet help track GPU resource usage across multiple nodes in your infrastructure.

Looking ahead: The GPU computing landscape in Docker is evolving rapidly. NVIDIA's upcoming multi-tenant GPU features, improved MIG flexibility, and container-native GPU scheduling will continue to make GPU sharing more efficient and accessible for containerized workloads.