Containers are not virtual machines. They share the host kernel, and that shared kernel is both the source of their efficiency and their greatest attack surface. A container escape exploiting a kernel vulnerability gives an attacker root access to every container on the host and the host itself. Understanding the layers of security between a containerized process and the host kernel is not optional knowledge for anyone running production workloads.

This guide examines container security from the bottom up: starting at the kernel primitives that make containers possible, moving through mandatory access control systems, and ending with runtime alternatives that fundamentally change the isolation model. Each layer reduces the blast radius if the layer above it fails.

Kernel Namespaces: The Foundation of Isolation

Linux namespaces are the fundamental building block of container isolation. They partition kernel resources so that one set of processes sees one set of resources while another set of processes sees a different set. Docker uses seven namespace types by default:

Namespace Isolates Security Impact
pid Process IDs Container cannot see or signal host processes
net Network stack Separate interfaces, routing tables, iptables rules
mnt Mount points Container has its own filesystem view
uts Hostname, domain Container gets its own hostname
ipc Shared memory, semaphores Prevents cross-container IPC attacks
user User/group IDs Root in container maps to unprivileged user on host
cgroup Cgroup root directory Container cannot see host cgroup hierarchy

You can inspect a container's namespace assignments directly:

# List namespaces for a container's init process
CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' my_container)
ls -la /proc/$CONTAINER_PID/ns/

# Compare with host namespaces
ls -la /proc/1/ns/

# Enter a container's namespaces manually (useful for debugging)
nsenter --target $CONTAINER_PID --mount --uts --ipc --net --pid -- /bin/sh
Warning: Running containers with --pid=host, --network=host, or --privileged disables the corresponding namespace isolation. A container with --pid=host can see and signal every process on the system, including other containers. Never use these flags in production without understanding the full implications.

Cgroups: Resource Limits as Security Boundaries

Control groups (cgroups) limit the resources a container can consume. While typically discussed in terms of performance, cgroups are a critical security mechanism. Without them, a single compromised container can starve the host and all other containers of CPU, memory, or I/O bandwidth — a denial-of-service attack from within.

# Run a container with strict resource limits
docker run -d \
  --name secured_app \
  --memory=512m \
  --memory-swap=512m \
  --memory-reservation=256m \
  --cpus=1.0 \
  --cpu-shares=512 \
  --pids-limit=256 \
  --ulimit nofile=1024:2048 \
  --ulimit nproc=512:512 \
  my_app:latest

The --pids-limit flag is particularly important for security. Without it, a fork bomb inside a container can exhaust the kernel's PID space and crash the entire host. Docker's default PID limit varies by installation; always set it explicitly.

In Docker Compose, resource limits are specified under the deploy section:

services:
  app:
    image: my_app:latest
    deploy:
      resources:
        limits:
          cpus: "1.0"
          memory: 512M
          pids: 256
        reservations:
          cpus: "0.25"
          memory: 128M
Tip: With usulnet, you can monitor container resource usage in real-time and receive alerts when containers approach their cgroup limits, making it easier to tune these values based on actual workload behavior rather than guesswork.

Linux Capabilities: Dropping Privileges

Traditional Unix security has two categories: root (UID 0) with full power, and everyone else. Linux capabilities break root's power into approximately 40 distinct privileges. Docker drops most capabilities by default, keeping only those needed for typical container operation.

The default capabilities Docker retains are:

# View default capabilities
docker run --rm alpine cat /proc/1/status | grep Cap

# Decode capability bitmask
capsh --decode=00000000a80425fb

# Default Docker capabilities (as of Docker 25+):
# AUDIT_WRITE, CHOWN, DAC_OVERRIDE, FOWNER, FSETID,
# KILL, MKNOD, NET_BIND_SERVICE, NET_RAW, SETFCAP,
# SETGID, SETPCAP, SETUID, SYS_CHROOT

For maximum security, drop all capabilities and add back only what your application specifically needs:

# Drop ALL capabilities, add back only what's needed
docker run -d \
  --cap-drop=ALL \
  --cap-add=NET_BIND_SERVICE \
  --name web_server \
  nginx:alpine

# For a container that needs no special privileges at all
docker run -d \
  --cap-drop=ALL \
  --name worker \
  my_worker:latest
Capability What It Allows When to Keep
NET_BIND_SERVICE Bind to ports below 1024 Web servers on port 80/443
CHOWN Change file ownership Init scripts that fix permissions
SETUID/SETGID Change process UID/GID Processes that drop privileges at startup
NET_RAW Use RAW/PACKET sockets Rarely needed; drop this for most workloads
SYS_ADMIN Broad system administration Almost never; this is effectively root
SYS_PTRACE Trace processes Debugging only; never in production
Warning: SYS_ADMIN is the single most dangerous capability. It allows mounting filesystems, using clone() with new namespaces, performing BPF operations, and more. It is so broad that granting it is nearly equivalent to running as --privileged. If your application "needs" SYS_ADMIN, you almost certainly need to redesign your approach.

Seccomp Profiles: Syscall Filtering

Seccomp (Secure Computing) filters restrict which system calls a containerized process can make to the kernel. Since container escapes typically exploit kernel vulnerabilities triggered via specific syscalls, reducing the available syscall surface directly reduces your attack surface.

Docker applies a default seccomp profile that blocks approximately 44 of the 300+ Linux syscalls. This blocks dangerous calls like mount, reboot, kexec_load, and ptrace, while allowing the vast majority of normal application syscalls.

# Run with the default seccomp profile (this is the default behavior)
docker run --security-opt seccomp=default my_app

# Run with a custom seccomp profile
docker run --security-opt seccomp=/path/to/profile.json my_app

# Generate a custom profile by tracing syscalls (using strace)
strace -c -f -S name docker run --rm my_app 2>&1 | tail -20

A custom seccomp profile is a JSON file that specifies the default action and per-syscall overrides:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "defaultErrnoRet": 1,
  "architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_AARCH64"],
  "syscalls": [
    {
      "names": [
        "accept", "accept4", "access", "bind", "brk",
        "clone", "close", "connect", "dup", "dup2", "dup3",
        "epoll_create", "epoll_create1", "epoll_ctl", "epoll_wait",
        "execve", "exit", "exit_group", "fcntl", "fstat",
        "futex", "getdents64", "getpid", "getsockopt",
        "ioctl", "listen", "lseek", "madvise", "mmap",
        "mprotect", "munmap", "nanosleep", "open", "openat",
        "pipe", "pipe2", "poll", "prctl", "read", "readlink",
        "recvfrom", "recvmsg", "rt_sigaction", "rt_sigprocmask",
        "sendmsg", "sendto", "setsockopt", "socket",
        "stat", "write", "writev"
      ],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}
Tip: Use tools like oci-seccomp-bpf-hook or seccomp-profiler to automatically generate a minimal seccomp profile by observing your application's actual syscall usage during normal operation. This gives you a tight-fitting profile without manual guesswork.

AppArmor: Mandatory Access Control

AppArmor confines programs by restricting their access to files, network, and capabilities beyond what DAC (discretionary access control) allows. Docker loads a default AppArmor profile called docker-default for every container unless overridden.

# Check if AppArmor is loaded for a container
docker inspect --format '{{.AppArmorProfile}}' my_container

# Run with a custom AppArmor profile
docker run --security-opt apparmor=my-custom-profile my_app

# Run without AppArmor (not recommended)
docker run --security-opt apparmor=unconfined my_app

A custom AppArmor profile for a web application might look like:

#include <tunables/global>

profile docker-nginx flags=(attach_disconnected,mediate_deleted) {
  #include <abstractions/base>
  #include <abstractions/nameservice>

  # Network access
  network inet tcp,
  network inet udp,
  network inet6 tcp,
  network inet6 udp,

  # Read-only access to web content
  /usr/share/nginx/** r,
  /etc/nginx/** r,
  /var/log/nginx/** w,
  /var/cache/nginx/** rw,
  /run/nginx.pid rw,

  # Deny access to sensitive paths
  deny /proc/*/mem rwklx,
  deny /sys/firmware/** rwklx,
  deny /proc/sysrq-trigger rwklx,
  deny /proc/kcore rwklx,

  # Deny mounting filesystems
  deny mount,

  # Deny raw socket access
  deny network raw,
  deny network packet,
}
# Load the profile
sudo apparmor_parser -r -W /etc/apparmor.d/docker-nginx

# Use it with a container
docker run -d --security-opt apparmor=docker-nginx nginx:alpine

SELinux: Type Enforcement for Containers

On RHEL, CentOS, Fedora, and their derivatives, SELinux provides mandatory access control using type enforcement. Docker containers run with the container_t SELinux type by default, which restricts what files they can access on the host.

# Check SELinux status
getenforce
sestatus

# View the SELinux label on a container process
ps -eZ | grep docker

# Run a container with a specific SELinux label
docker run --security-opt label=type:svirt_apache_t my_app

# Disable SELinux for a container (not recommended)
docker run --security-opt label=disable my_app

# Relabel a volume for container access
docker run -v /host/data:/data:Z my_app   # Private label
docker run -v /host/data:/data:z my_app   # Shared label

AppArmor vs SELinux: Both are MAC systems that achieve similar goals. AppArmor uses path-based rules (easier to write), while SELinux uses label-based enforcement (more granular but more complex). Use whichever your distribution ships and supports. Do not disable either to "fix" container permission issues; instead, write proper policies.

Read-Only Root Filesystem

By default, containers have a writable filesystem layer. An attacker who gains code execution inside a container can write malicious binaries, modify configuration files, or install tools for lateral movement. A read-only root filesystem prevents all of this.

# Run with read-only root filesystem
docker run -d \
  --read-only \
  --tmpfs /tmp:rw,noexec,nosuid,size=64m \
  --tmpfs /run:rw,noexec,nosuid,size=32m \
  --name secured_nginx \
  nginx:alpine

Most applications need a few writable directories for temporary files, PID files, or caches. Use tmpfs mounts for these, with noexec to prevent execution of written files:

services:
  app:
    image: my_app:latest
    read_only: true
    tmpfs:
      - /tmp:rw,noexec,nosuid,size=64m
      - /run:rw,noexec,nosuid,size=32m
      - /var/cache/nginx:rw,noexec,nosuid,size=128m
    volumes:
      - app_data:/data  # Only this volume is writable

No-New-Privileges Flag

The no-new-privileges security option prevents processes inside the container from gaining additional privileges through setuid/setgid binaries, capability transitions, or other escalation mechanisms.

# Enable no-new-privileges
docker run --security-opt no-new-privileges:true my_app

# In Docker Compose
services:
  app:
    image: my_app:latest
    security_opt:
      - no-new-privileges:true

This is one of the simplest and most effective hardening options. It prevents a common attack pattern where an attacker exploits a vulnerability in a non-root process, then escalates to root via a setuid binary inside the container. With no-new-privileges, even if the container has setuid binaries, they cannot grant additional privileges.

User Namespaces: Remapping Root

User namespaces are the most powerful isolation feature available, yet the most underused. They remap the UID inside the container to a different, unprivileged UID on the host. A process running as root (UID 0) inside the container actually runs as, say, UID 100000 on the host. If the process escapes the container, it lands on the host as an unprivileged user.

# Enable user namespace remapping in Docker daemon
# /etc/docker/daemon.json
{
  "userns-remap": "default"
}

# Docker creates a dockremap user and configures subordinate IDs
# Check the mappings:
cat /etc/subuid
# dockremap:100000:65536

cat /etc/subgid
# dockremap:100000:65536

# Restart Docker to apply
sudo systemctl restart docker

# Verify: container root maps to unprivileged host UID
docker run --rm alpine id
# uid=0(root) gid=0(root)

# On the host:
ps -eo uid,pid,cmd | grep alpine
# 100000  12345  /bin/sh
Warning: Enabling user namespaces affects volume permissions. Files owned by UID 0 inside the container are owned by UID 100000 on the host. Existing volumes may need permission changes. Some containers that require true host root access (like Docker-in-Docker) are incompatible with user namespace remapping.

Putting It All Together: A Hardened Container

Here is a Docker Compose configuration that applies every security mechanism discussed:

services:
  web:
    image: my_web_app:latest
    read_only: true
    tmpfs:
      - /tmp:rw,noexec,nosuid,size=64m
      - /run:rw,noexec,nosuid,size=32m
    security_opt:
      - no-new-privileges:true
      - apparmor=docker-web-app
      - seccomp=/etc/docker/seccomp/web-app.json
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE
    deploy:
      resources:
        limits:
          cpus: "2.0"
          memory: 1G
          pids: 512
    user: "1000:1000"
    networks:
      - frontend
    volumes:
      - app_data:/data:rw

networks:
  frontend:
    internal: false
  backend:
    internal: true  # No external access

Container Runtime Comparison

The OCI runtime is the component that actually creates and runs the container. Different runtimes offer fundamentally different security properties:

Runtime Isolation Performance Compatibility Use Case
runc Namespaces + cgroups (kernel shared) Native Full Default, general purpose
crun Same as runc (C implementation) Faster startup Full Performance-sensitive, Podman default
gVisor (runsc) User-space kernel (syscall interception) ~20-30% overhead Most workloads Untrusted code, multi-tenant
Kata Containers Lightweight VM per container ~10-15% overhead Most workloads Maximum isolation, compliance

runc: The Default Runtime

runc is the reference OCI runtime. It uses standard Linux namespaces and cgroups. All security depends on the host kernel's correct implementation of these features. A kernel vulnerability in namespace handling can compromise every container on the host.

# Check your current runtime
docker info --format '{{.DefaultRuntime}}'

# runc version
runc --version

gVisor: User-Space Kernel Isolation

gVisor intercepts application syscalls and handles them in a user-space kernel called Sentry, written in Go. The application never directly interacts with the host kernel. Even if an attacker finds a "kernel" vulnerability, they are exploiting gVisor's user-space implementation, not the real kernel.

# Install gVisor
curl -fsSL https://gvisor.dev/archive.key | sudo gpg --dearmor -o /usr/share/keyrings/gvisor-archive-keyring.gpg
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/gvisor-archive-keyring.gpg] https://storage.googleapis.com/gvisor/releases release main" | sudo tee /etc/apt/sources.list.d/gvisor.list
sudo apt update && sudo apt install -y runsc

# Configure Docker to use gVisor
# /etc/docker/daemon.json
{
  "runtimes": {
    "runsc": {
      "path": "/usr/bin/runsc"
    }
  }
}

# Run a container with gVisor
docker run --runtime=runsc -d nginx:alpine

# Verify isolation - the container sees gVisor's kernel, not the host kernel
docker run --runtime=runsc --rm alpine uname -r
# 4.4.0  (gVisor's emulated kernel version)

Kata Containers: VM-Level Isolation

Kata Containers run each container inside a lightweight virtual machine with its own kernel. This provides true hardware-level isolation using Intel VT-x or AMD-V. The attack surface shrinks to the hypervisor (QEMU/Cloud Hypervisor), which has a much smaller and better-audited codebase than the Linux kernel's container subsystems.

# Install Kata Containers
bash -c "$(curl -fsSL https://raw.githubusercontent.com/kata-containers/kata-containers/main/utils/kata-manager.sh)" -- install-packages

# Configure Docker
# /etc/docker/daemon.json
{
  "runtimes": {
    "kata": {
      "path": "/usr/bin/kata-runtime"
    }
  }
}

# Run a container with Kata
docker run --runtime=kata -d nginx:alpine

# The container runs in its own VM with its own kernel
docker run --runtime=kata --rm alpine uname -r
# 6.1.62  (Kata's guest kernel, different from host)

When to use which runtime: Use runc/crun for trusted workloads where performance is critical. Use gVisor for untrusted code execution or multi-tenant environments where you need strong isolation without VM overhead. Use Kata Containers when compliance requires VM-level isolation or when running the most sensitive workloads.

Security Monitoring and Auditing

Hardening is only half the equation. You need to detect when security boundaries are being tested or violated. Tools like usulnet provide real-time visibility into container security posture, including which containers are running as root, which have excessive capabilities, and which lack resource limits.

# Quick audit: find containers running as root
docker ps -q | xargs -I {} docker inspect --format \
  '{{.Name}}: User={{.Config.User}} Privileged={{.HostConfig.Privileged}}' {}

# Find containers with dangerous capabilities
docker ps -q | xargs -I {} docker inspect --format \
  '{{.Name}}: CapAdd={{.HostConfig.CapAdd}} CapDrop={{.HostConfig.CapDrop}}' {}

# Find containers without resource limits
docker ps -q | xargs -I {} docker inspect --format \
  '{{.Name}}: Memory={{.HostConfig.Memory}} CPUs={{.HostConfig.NanoCpus}}' {} \
  | grep "Memory=0"
Tip: Make security auditing part of your CI/CD pipeline. Tools like Docker Bench for Security (docker/docker-bench-security) can automatically check your Docker daemon and container configurations against CIS benchmarks. Integrate this with your container management platform to maintain continuous visibility.

Defense in Depth Checklist

Container runtime security is about layering defenses so that no single failure is catastrophic. Here is a prioritized checklist:

  1. Drop all capabilities and add back only what is needed (--cap-drop=ALL)
  2. Enable no-new-privileges (--security-opt no-new-privileges:true)
  3. Set resource limits (memory, CPU, PIDs) on every container
  4. Use read-only root filesystem with tmpfs for writable directories
  5. Run as non-root user inside the container (USER directive in Dockerfile)
  6. Apply a custom seccomp profile tailored to your application's syscall needs
  7. Enable AppArmor or SELinux with per-application profiles
  8. Enable user namespace remapping for defense against container escapes
  9. Use network segmentation with internal networks for backend services
  10. Consider gVisor or Kata for untrusted or highly sensitive workloads

No single measure provides complete security. The goal is that when one layer fails, the next layer catches the attacker. A compromised application inside a container with dropped capabilities, a read-only filesystem, a custom seccomp profile, and user namespace remapping gives an attacker almost nothing to work with, even if they achieve arbitrary code execution.