Container Runtime Security: Protecting Docker from Kernel to Application
Containers are not virtual machines. They share the host kernel, and that shared kernel is both the source of their efficiency and their greatest attack surface. A container escape exploiting a kernel vulnerability gives an attacker root access to every container on the host and the host itself. Understanding the layers of security between a containerized process and the host kernel is not optional knowledge for anyone running production workloads.
This guide examines container security from the bottom up: starting at the kernel primitives that make containers possible, moving through mandatory access control systems, and ending with runtime alternatives that fundamentally change the isolation model. Each layer reduces the blast radius if the layer above it fails.
Kernel Namespaces: The Foundation of Isolation
Linux namespaces are the fundamental building block of container isolation. They partition kernel resources so that one set of processes sees one set of resources while another set of processes sees a different set. Docker uses seven namespace types by default:
| Namespace | Isolates | Security Impact |
|---|---|---|
pid |
Process IDs | Container cannot see or signal host processes |
net |
Network stack | Separate interfaces, routing tables, iptables rules |
mnt |
Mount points | Container has its own filesystem view |
uts |
Hostname, domain | Container gets its own hostname |
ipc |
Shared memory, semaphores | Prevents cross-container IPC attacks |
user |
User/group IDs | Root in container maps to unprivileged user on host |
cgroup |
Cgroup root directory | Container cannot see host cgroup hierarchy |
You can inspect a container's namespace assignments directly:
# List namespaces for a container's init process
CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' my_container)
ls -la /proc/$CONTAINER_PID/ns/
# Compare with host namespaces
ls -la /proc/1/ns/
# Enter a container's namespaces manually (useful for debugging)
nsenter --target $CONTAINER_PID --mount --uts --ipc --net --pid -- /bin/sh
--pid=host, --network=host, or --privileged disables the corresponding namespace isolation. A container with --pid=host can see and signal every process on the system, including other containers. Never use these flags in production without understanding the full implications.
Cgroups: Resource Limits as Security Boundaries
Control groups (cgroups) limit the resources a container can consume. While typically discussed in terms of performance, cgroups are a critical security mechanism. Without them, a single compromised container can starve the host and all other containers of CPU, memory, or I/O bandwidth — a denial-of-service attack from within.
# Run a container with strict resource limits
docker run -d \
--name secured_app \
--memory=512m \
--memory-swap=512m \
--memory-reservation=256m \
--cpus=1.0 \
--cpu-shares=512 \
--pids-limit=256 \
--ulimit nofile=1024:2048 \
--ulimit nproc=512:512 \
my_app:latest
The --pids-limit flag is particularly important for security. Without it, a fork bomb inside a container can exhaust the kernel's PID space and crash the entire host. Docker's default PID limit varies by installation; always set it explicitly.
In Docker Compose, resource limits are specified under the deploy section:
services:
app:
image: my_app:latest
deploy:
resources:
limits:
cpus: "1.0"
memory: 512M
pids: 256
reservations:
cpus: "0.25"
memory: 128M
Linux Capabilities: Dropping Privileges
Traditional Unix security has two categories: root (UID 0) with full power, and everyone else. Linux capabilities break root's power into approximately 40 distinct privileges. Docker drops most capabilities by default, keeping only those needed for typical container operation.
The default capabilities Docker retains are:
# View default capabilities
docker run --rm alpine cat /proc/1/status | grep Cap
# Decode capability bitmask
capsh --decode=00000000a80425fb
# Default Docker capabilities (as of Docker 25+):
# AUDIT_WRITE, CHOWN, DAC_OVERRIDE, FOWNER, FSETID,
# KILL, MKNOD, NET_BIND_SERVICE, NET_RAW, SETFCAP,
# SETGID, SETPCAP, SETUID, SYS_CHROOT
For maximum security, drop all capabilities and add back only what your application specifically needs:
# Drop ALL capabilities, add back only what's needed
docker run -d \
--cap-drop=ALL \
--cap-add=NET_BIND_SERVICE \
--name web_server \
nginx:alpine
# For a container that needs no special privileges at all
docker run -d \
--cap-drop=ALL \
--name worker \
my_worker:latest
| Capability | What It Allows | When to Keep |
|---|---|---|
NET_BIND_SERVICE |
Bind to ports below 1024 | Web servers on port 80/443 |
CHOWN |
Change file ownership | Init scripts that fix permissions |
SETUID/SETGID |
Change process UID/GID | Processes that drop privileges at startup |
NET_RAW |
Use RAW/PACKET sockets | Rarely needed; drop this for most workloads |
SYS_ADMIN |
Broad system administration | Almost never; this is effectively root |
SYS_PTRACE |
Trace processes | Debugging only; never in production |
SYS_ADMIN is the single most dangerous capability. It allows mounting filesystems, using clone() with new namespaces, performing BPF operations, and more. It is so broad that granting it is nearly equivalent to running as --privileged. If your application "needs" SYS_ADMIN, you almost certainly need to redesign your approach.
Seccomp Profiles: Syscall Filtering
Seccomp (Secure Computing) filters restrict which system calls a containerized process can make to the kernel. Since container escapes typically exploit kernel vulnerabilities triggered via specific syscalls, reducing the available syscall surface directly reduces your attack surface.
Docker applies a default seccomp profile that blocks approximately 44 of the 300+ Linux syscalls. This blocks dangerous calls like mount, reboot, kexec_load, and ptrace, while allowing the vast majority of normal application syscalls.
# Run with the default seccomp profile (this is the default behavior)
docker run --security-opt seccomp=default my_app
# Run with a custom seccomp profile
docker run --security-opt seccomp=/path/to/profile.json my_app
# Generate a custom profile by tracing syscalls (using strace)
strace -c -f -S name docker run --rm my_app 2>&1 | tail -20
A custom seccomp profile is a JSON file that specifies the default action and per-syscall overrides:
{
"defaultAction": "SCMP_ACT_ERRNO",
"defaultErrnoRet": 1,
"architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_AARCH64"],
"syscalls": [
{
"names": [
"accept", "accept4", "access", "bind", "brk",
"clone", "close", "connect", "dup", "dup2", "dup3",
"epoll_create", "epoll_create1", "epoll_ctl", "epoll_wait",
"execve", "exit", "exit_group", "fcntl", "fstat",
"futex", "getdents64", "getpid", "getsockopt",
"ioctl", "listen", "lseek", "madvise", "mmap",
"mprotect", "munmap", "nanosleep", "open", "openat",
"pipe", "pipe2", "poll", "prctl", "read", "readlink",
"recvfrom", "recvmsg", "rt_sigaction", "rt_sigprocmask",
"sendmsg", "sendto", "setsockopt", "socket",
"stat", "write", "writev"
],
"action": "SCMP_ACT_ALLOW"
}
]
}
oci-seccomp-bpf-hook or seccomp-profiler to automatically generate a minimal seccomp profile by observing your application's actual syscall usage during normal operation. This gives you a tight-fitting profile without manual guesswork.
AppArmor: Mandatory Access Control
AppArmor confines programs by restricting their access to files, network, and capabilities beyond what DAC (discretionary access control) allows. Docker loads a default AppArmor profile called docker-default for every container unless overridden.
# Check if AppArmor is loaded for a container
docker inspect --format '{{.AppArmorProfile}}' my_container
# Run with a custom AppArmor profile
docker run --security-opt apparmor=my-custom-profile my_app
# Run without AppArmor (not recommended)
docker run --security-opt apparmor=unconfined my_app
A custom AppArmor profile for a web application might look like:
#include <tunables/global>
profile docker-nginx flags=(attach_disconnected,mediate_deleted) {
#include <abstractions/base>
#include <abstractions/nameservice>
# Network access
network inet tcp,
network inet udp,
network inet6 tcp,
network inet6 udp,
# Read-only access to web content
/usr/share/nginx/** r,
/etc/nginx/** r,
/var/log/nginx/** w,
/var/cache/nginx/** rw,
/run/nginx.pid rw,
# Deny access to sensitive paths
deny /proc/*/mem rwklx,
deny /sys/firmware/** rwklx,
deny /proc/sysrq-trigger rwklx,
deny /proc/kcore rwklx,
# Deny mounting filesystems
deny mount,
# Deny raw socket access
deny network raw,
deny network packet,
}
# Load the profile
sudo apparmor_parser -r -W /etc/apparmor.d/docker-nginx
# Use it with a container
docker run -d --security-opt apparmor=docker-nginx nginx:alpine
SELinux: Type Enforcement for Containers
On RHEL, CentOS, Fedora, and their derivatives, SELinux provides mandatory access control using type enforcement. Docker containers run with the container_t SELinux type by default, which restricts what files they can access on the host.
# Check SELinux status
getenforce
sestatus
# View the SELinux label on a container process
ps -eZ | grep docker
# Run a container with a specific SELinux label
docker run --security-opt label=type:svirt_apache_t my_app
# Disable SELinux for a container (not recommended)
docker run --security-opt label=disable my_app
# Relabel a volume for container access
docker run -v /host/data:/data:Z my_app # Private label
docker run -v /host/data:/data:z my_app # Shared label
AppArmor vs SELinux: Both are MAC systems that achieve similar goals. AppArmor uses path-based rules (easier to write), while SELinux uses label-based enforcement (more granular but more complex). Use whichever your distribution ships and supports. Do not disable either to "fix" container permission issues; instead, write proper policies.
Read-Only Root Filesystem
By default, containers have a writable filesystem layer. An attacker who gains code execution inside a container can write malicious binaries, modify configuration files, or install tools for lateral movement. A read-only root filesystem prevents all of this.
# Run with read-only root filesystem
docker run -d \
--read-only \
--tmpfs /tmp:rw,noexec,nosuid,size=64m \
--tmpfs /run:rw,noexec,nosuid,size=32m \
--name secured_nginx \
nginx:alpine
Most applications need a few writable directories for temporary files, PID files, or caches. Use tmpfs mounts for these, with noexec to prevent execution of written files:
services:
app:
image: my_app:latest
read_only: true
tmpfs:
- /tmp:rw,noexec,nosuid,size=64m
- /run:rw,noexec,nosuid,size=32m
- /var/cache/nginx:rw,noexec,nosuid,size=128m
volumes:
- app_data:/data # Only this volume is writable
No-New-Privileges Flag
The no-new-privileges security option prevents processes inside the container from gaining additional privileges through setuid/setgid binaries, capability transitions, or other escalation mechanisms.
# Enable no-new-privileges
docker run --security-opt no-new-privileges:true my_app
# In Docker Compose
services:
app:
image: my_app:latest
security_opt:
- no-new-privileges:true
This is one of the simplest and most effective hardening options. It prevents a common attack pattern where an attacker exploits a vulnerability in a non-root process, then escalates to root via a setuid binary inside the container. With no-new-privileges, even if the container has setuid binaries, they cannot grant additional privileges.
User Namespaces: Remapping Root
User namespaces are the most powerful isolation feature available, yet the most underused. They remap the UID inside the container to a different, unprivileged UID on the host. A process running as root (UID 0) inside the container actually runs as, say, UID 100000 on the host. If the process escapes the container, it lands on the host as an unprivileged user.
# Enable user namespace remapping in Docker daemon
# /etc/docker/daemon.json
{
"userns-remap": "default"
}
# Docker creates a dockremap user and configures subordinate IDs
# Check the mappings:
cat /etc/subuid
# dockremap:100000:65536
cat /etc/subgid
# dockremap:100000:65536
# Restart Docker to apply
sudo systemctl restart docker
# Verify: container root maps to unprivileged host UID
docker run --rm alpine id
# uid=0(root) gid=0(root)
# On the host:
ps -eo uid,pid,cmd | grep alpine
# 100000 12345 /bin/sh
Putting It All Together: A Hardened Container
Here is a Docker Compose configuration that applies every security mechanism discussed:
services:
web:
image: my_web_app:latest
read_only: true
tmpfs:
- /tmp:rw,noexec,nosuid,size=64m
- /run:rw,noexec,nosuid,size=32m
security_opt:
- no-new-privileges:true
- apparmor=docker-web-app
- seccomp=/etc/docker/seccomp/web-app.json
cap_drop:
- ALL
cap_add:
- NET_BIND_SERVICE
deploy:
resources:
limits:
cpus: "2.0"
memory: 1G
pids: 512
user: "1000:1000"
networks:
- frontend
volumes:
- app_data:/data:rw
networks:
frontend:
internal: false
backend:
internal: true # No external access
Container Runtime Comparison
The OCI runtime is the component that actually creates and runs the container. Different runtimes offer fundamentally different security properties:
| Runtime | Isolation | Performance | Compatibility | Use Case |
|---|---|---|---|---|
| runc | Namespaces + cgroups (kernel shared) | Native | Full | Default, general purpose |
| crun | Same as runc (C implementation) | Faster startup | Full | Performance-sensitive, Podman default |
| gVisor (runsc) | User-space kernel (syscall interception) | ~20-30% overhead | Most workloads | Untrusted code, multi-tenant |
| Kata Containers | Lightweight VM per container | ~10-15% overhead | Most workloads | Maximum isolation, compliance |
runc: The Default Runtime
runc is the reference OCI runtime. It uses standard Linux namespaces and cgroups. All security depends on the host kernel's correct implementation of these features. A kernel vulnerability in namespace handling can compromise every container on the host.
# Check your current runtime
docker info --format '{{.DefaultRuntime}}'
# runc version
runc --version
gVisor: User-Space Kernel Isolation
gVisor intercepts application syscalls and handles them in a user-space kernel called Sentry, written in Go. The application never directly interacts with the host kernel. Even if an attacker finds a "kernel" vulnerability, they are exploiting gVisor's user-space implementation, not the real kernel.
# Install gVisor
curl -fsSL https://gvisor.dev/archive.key | sudo gpg --dearmor -o /usr/share/keyrings/gvisor-archive-keyring.gpg
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/gvisor-archive-keyring.gpg] https://storage.googleapis.com/gvisor/releases release main" | sudo tee /etc/apt/sources.list.d/gvisor.list
sudo apt update && sudo apt install -y runsc
# Configure Docker to use gVisor
# /etc/docker/daemon.json
{
"runtimes": {
"runsc": {
"path": "/usr/bin/runsc"
}
}
}
# Run a container with gVisor
docker run --runtime=runsc -d nginx:alpine
# Verify isolation - the container sees gVisor's kernel, not the host kernel
docker run --runtime=runsc --rm alpine uname -r
# 4.4.0 (gVisor's emulated kernel version)
Kata Containers: VM-Level Isolation
Kata Containers run each container inside a lightweight virtual machine with its own kernel. This provides true hardware-level isolation using Intel VT-x or AMD-V. The attack surface shrinks to the hypervisor (QEMU/Cloud Hypervisor), which has a much smaller and better-audited codebase than the Linux kernel's container subsystems.
# Install Kata Containers
bash -c "$(curl -fsSL https://raw.githubusercontent.com/kata-containers/kata-containers/main/utils/kata-manager.sh)" -- install-packages
# Configure Docker
# /etc/docker/daemon.json
{
"runtimes": {
"kata": {
"path": "/usr/bin/kata-runtime"
}
}
}
# Run a container with Kata
docker run --runtime=kata -d nginx:alpine
# The container runs in its own VM with its own kernel
docker run --runtime=kata --rm alpine uname -r
# 6.1.62 (Kata's guest kernel, different from host)
When to use which runtime: Use runc/crun for trusted workloads where performance is critical. Use gVisor for untrusted code execution or multi-tenant environments where you need strong isolation without VM overhead. Use Kata Containers when compliance requires VM-level isolation or when running the most sensitive workloads.
Security Monitoring and Auditing
Hardening is only half the equation. You need to detect when security boundaries are being tested or violated. Tools like usulnet provide real-time visibility into container security posture, including which containers are running as root, which have excessive capabilities, and which lack resource limits.
# Quick audit: find containers running as root
docker ps -q | xargs -I {} docker inspect --format \
'{{.Name}}: User={{.Config.User}} Privileged={{.HostConfig.Privileged}}' {}
# Find containers with dangerous capabilities
docker ps -q | xargs -I {} docker inspect --format \
'{{.Name}}: CapAdd={{.HostConfig.CapAdd}} CapDrop={{.HostConfig.CapDrop}}' {}
# Find containers without resource limits
docker ps -q | xargs -I {} docker inspect --format \
'{{.Name}}: Memory={{.HostConfig.Memory}} CPUs={{.HostConfig.NanoCpus}}' {} \
| grep "Memory=0"
docker/docker-bench-security) can automatically check your Docker daemon and container configurations against CIS benchmarks. Integrate this with your container management platform to maintain continuous visibility.
Defense in Depth Checklist
Container runtime security is about layering defenses so that no single failure is catastrophic. Here is a prioritized checklist:
- Drop all capabilities and add back only what is needed (
--cap-drop=ALL) - Enable no-new-privileges (
--security-opt no-new-privileges:true) - Set resource limits (memory, CPU, PIDs) on every container
- Use read-only root filesystem with tmpfs for writable directories
- Run as non-root user inside the container (
USERdirective in Dockerfile) - Apply a custom seccomp profile tailored to your application's syscall needs
- Enable AppArmor or SELinux with per-application profiles
- Enable user namespace remapping for defense against container escapes
- Use network segmentation with internal networks for backend services
- Consider gVisor or Kata for untrusted or highly sensitive workloads
No single measure provides complete security. The goal is that when one layer fails, the next layer catches the attacker. A compromised application inside a container with dropped capabilities, a read-only filesystem, a custom seccomp profile, and user namespace remapping gives an attacker almost nothing to work with, even if they achieve arbitrary code execution.