Linux Kernel Tuning: sysctl Parameters Every Admin Should Know
The Linux kernel exposes thousands of tunable parameters through the /proc/sys filesystem and the sysctl interface. These parameters control how the kernel handles networking, memory, filesystem operations, and security. The defaults are designed for general-purpose compatibility, which means they are suboptimal for every specific workload -- whether that is a high-throughput web server, a database host, or a Docker container runtime.
This guide organizes the most important sysctl parameters by category, explains what each one does, recommends values for common server workloads, and covers Docker-specific tuning that is often overlooked.
How /proc/sys Works
# The /proc/sys directory mirrors the sysctl namespace
# Each file corresponds to a kernel parameter
# Read a parameter
cat /proc/sys/net/ipv4/ip_forward
sysctl net.ipv4.ip_forward
# Set a parameter at runtime (non-persistent)
echo 1 > /proc/sys/net/ipv4/ip_forward
sysctl -w net.ipv4.ip_forward=1
# Make parameters persistent (survives reboot)
# Add to /etc/sysctl.conf or a file in /etc/sysctl.d/
echo "net.ipv4.ip_forward = 1" >> /etc/sysctl.d/99-custom.conf
# Load all sysctl files
sysctl --system
# Show all current parameters
sysctl -a | wc -l # Typically 1000+ parameters
/etc/sysctl.d/ rather than editing /etc/sysctl.conf directly. Files are loaded in lexical order, so 99-custom.conf will override settings in 10-network.conf. This makes it easy to organize and override settings modularly.
Network Tuning
TCP Performance
# /etc/sysctl.d/10-network-performance.conf
# TCP buffer sizes (min, default, max) in bytes
# These control how much data TCP can buffer per connection
net.ipv4.tcp_rmem = 4096 1048576 16777216
net.ipv4.tcp_wmem = 4096 1048576 16777216
# Socket buffer sizes (global defaults and limits)
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 1048576
net.core.wmem_default = 1048576
# Connection backlog
# Maximum number of connections waiting to be accepted
net.core.somaxconn = 65535
# Maximum backlog of unprocessed packets
net.core.netdev_max_backlog = 65536
# SYN backlog (queue for half-open connections)
net.ipv4.tcp_max_syn_backlog = 65535
# TCP congestion control
# BBR provides better throughput and lower latency than CUBIC
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr
# TCP Fast Open (reduces connection setup latency)
# 3 = enabled for both client and server
net.ipv4.tcp_fastopen = 3
# TIME_WAIT optimization
# Allow reuse of TIME_WAIT sockets for new connections
net.ipv4.tcp_tw_reuse = 1
# Reduce TIME_WAIT duration (default: 60)
net.ipv4.tcp_fin_timeout = 15
# Local port range for outgoing connections
net.ipv4.ip_local_port_range = 1024 65535
# Increase maximum number of orphaned sockets
net.ipv4.tcp_max_orphans = 65535
# TCP keepalive (detect dead connections faster)
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 5
| Parameter | Default | Recommended | Impact |
|---|---|---|---|
somaxconn |
4096 | 65535 | Higher connection accept queue |
tcp_congestion_control |
cubic | bbr | Better throughput, lower latency |
tcp_fastopen |
1 | 3 | Faster connection establishment |
tcp_tw_reuse |
2 | 1 | Reuse TIME_WAIT sockets |
tcp_fin_timeout |
60 | 15 | Faster cleanup of closed connections |
Connection Tracking (conntrack)
Connection tracking is critical for firewalls and NAT (including Docker's port mapping). The conntrack table has a maximum size, and exceeding it causes dropped connections:
# /etc/sysctl.d/11-conntrack.conf
# Maximum tracked connections (default: 65536)
# For Docker hosts with many containers, increase significantly
net.netfilter.nf_conntrack_max = 1048576
# Conntrack timeout tuning (seconds)
# Reduce for servers with many short-lived connections
net.netfilter.nf_conntrack_tcp_timeout_established = 600
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 30
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 30
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 30
# UDP timeout (important for DNS servers)
net.netfilter.nf_conntrack_udp_timeout = 30
net.netfilter.nf_conntrack_udp_timeout_stream = 60
# Check current conntrack usage
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max
# If count approaches max, you will see this in dmesg:
# nf_conntrack: table full, dropping packet
# Increase hash table size (must be done via modprobe, not sysctl)
echo "options nf_conntrack hashsize=262144" > /etc/modprobe.d/nf_conntrack.conf
# Monitor conntrack in real time
conntrack -L | wc -l
conntrack -S
/proc/sys/net/netfilter/nf_conntrack_count and increase the max proactively.
Memory Tuning
Swappiness and Caching
# /etc/sysctl.d/20-memory.conf
# vm.swappiness (0-200)
# Lower = prefer keeping application data in RAM, evict cache instead
# Higher = more willing to swap application pages
# Default: 60
# Server recommendation: 10 (keep apps in RAM)
# Database server: 1 (almost never swap)
vm.swappiness = 10
# VFS cache pressure (0-1000)
# Controls tendency to reclaim inode/dentry cache
# Lower = keep filesystem cache longer
# Default: 100
# For file-heavy workloads (web servers, NFS): lower
vm.vfs_cache_pressure = 50
# Overcommit memory
# 0 = Heuristic overcommit (default, allows overcommit)
# 1 = Always overcommit (Redis requires this)
# 2 = Never overcommit (strictest, limit to swap + ratio*physical)
vm.overcommit_memory = 0
# Overcommit ratio (only used when overcommit_memory=2)
# Percentage of physical RAM that can be overcommitted
vm.overcommit_ratio = 80
Dirty Page Management
# Controls when the kernel flushes dirty (modified) pages to disk
# Percentage of total RAM that can be dirty before blocking writes
# Default: 20
vm.dirty_ratio = 15
# Percentage of total RAM that triggers background flushing
# Default: 10
vm.dirty_background_ratio = 5
# How long (centiseconds) dirty data can stay in memory before forced write
# Default: 3000 (30 seconds)
vm.dirty_expire_centisecs = 3000
# How often (centiseconds) the flush daemon wakes up
# Default: 500 (5 seconds)
vm.dirty_writeback_centisecs = 500
OOM Killer Tuning
# The OOM (Out of Memory) Killer terminates processes when
# the system runs out of memory
# Panic on OOM (useful for servers that should reboot rather than limp along)
# 0 = kill processes (default)
# 1 = panic (reboot via kernel panic)
vm.panic_on_oom = 0
# Per-process OOM score adjustment
# Lower score = less likely to be killed
# -1000 to 1000
# Protect critical processes
echo -1000 > /proc/$(pidof sshd)/oom_score_adj # Never kill sshd
echo -500 > /proc/$(pidof postgres)/oom_score_adj # Protect database
# Make OOM killer more aggressive
echo 500 > /proc/$(pidof memory-hungry-app)/oom_score_adj
# For Docker containers, set via compose:
# deploy:
# resources:
# limits:
# memory: 2G
# Or: docker run --oom-score-adj=-500 ...
Filesystem Parameters
# /etc/sysctl.d/30-filesystem.conf
# Maximum number of open file descriptors (system-wide)
# Default: ~100000 (depends on RAM)
# For servers running many connections: increase
fs.file-max = 2097152
# Maximum number of open file descriptors per process
# Default: 1048576
fs.nr_open = 2097152
# inotify limits (important for file watchers, IDEs, container monitoring)
# max_user_watches: maximum number of watches per user
# Default: 8192
fs.inotify.max_user_watches = 524288
# max_user_instances: maximum number of inotify instances per user
# Default: 128
fs.inotify.max_user_instances = 1024
# max_queued_events: maximum number of queued events per instance
# Default: 16384
fs.inotify.max_queued_events = 32768
# AIO (Async I/O) limits
# For databases that use async I/O (PostgreSQL, MySQL)
fs.aio-max-nr = 1048576
fs.file-max parameter and the per-process limits set in systemd unit files (LimitNOFILE) or /etc/security/limits.conf. Both can cause this error, and they are independent of each other.
Security Parameters
# /etc/sysctl.d/40-security.conf
# ASLR (Address Space Layout Randomization)
# 0 = disabled, 1 = conservative, 2 = full (default)
# Always keep at 2 for security
kernel.randomize_va_space = 2
# Restrict access to kernel pointers in /proc
# 0 = visible to all, 1 = hidden for non-privileged, 2 = hidden for all
kernel.kptr_restrict = 2
# Restrict dmesg access to root
# 0 = all users can read, 1 = root only
kernel.dmesg_restrict = 1
# Restrict ptrace (debugging other processes)
# 0 = classic ptrace (any parent can trace children)
# 1 = restricted (only direct parent)
# 2 = admin only
# 3 = no ptrace at all
kernel.yama.ptrace_scope = 2
# Disable core dumps for SUID programs
fs.suid_dumpable = 0
# Restrict unprivileged BPF
kernel.unprivileged_bpf_disabled = 1
# Restrict userfaultfd to privileged users
vm.unprivileged_userfaultfd = 0
# SYN flood protection
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_max_syn_backlog = 2048
net.ipv4.tcp_synack_retries = 2
# IP spoofing protection
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1
# Disable ICMP redirects (prevent MITM)
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.default.accept_redirects = 0
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.default.send_redirects = 0
net.ipv6.conf.all.accept_redirects = 0
net.ipv6.conf.default.accept_redirects = 0
# Disable source routing
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.conf.default.accept_source_route = 0
net.ipv6.conf.all.accept_source_route = 0
# Log martian packets (packets with impossible addresses)
net.ipv4.conf.all.log_martians = 1
net.ipv4.conf.default.log_martians = 1
# Ignore ICMP broadcast (Smurf attack prevention)
net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv4.icmp_ignore_bogus_error_responses = 1
# RFC 1337 (prevent TIME_WAIT assassination)
net.ipv4.tcp_rfc1337 = 1
Docker-Specific Tuning
Docker hosts have unique kernel tuning requirements. Container networking, port mapping, and the sheer number of processes running in containers all demand specific parameter adjustments:
# /etc/sysctl.d/50-docker.conf
# REQUIRED: Enable IP forwarding (Docker won't work without this)
net.ipv4.ip_forward = 1
# REQUIRED: Bridge netfilter (allows iptables to filter bridged traffic)
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
# Note: the br_netfilter module must be loaded first
# echo "br_netfilter" > /etc/modules-load.d/br_netfilter.conf
# Connection tracking (increase for many containers)
net.netfilter.nf_conntrack_max = 1048576
# File descriptors (containers share the host's limit)
fs.file-max = 2097152
# inotify watches (each container may need many)
fs.inotify.max_user_watches = 524288
fs.inotify.max_user_instances = 1024
# Memory overcommit (Redis in containers requires this)
vm.overcommit_memory = 1
# Max memory map areas (Elasticsearch requires high value)
vm.max_map_count = 262144
# Network performance for container communication
net.core.somaxconn = 65535
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_tw_reuse = 1
# Reduce swappiness (containers should stay in RAM)
vm.swappiness = 10
# Disable IPv6 in containers if not needed
# net.ipv6.conf.all.disable_ipv6 = 1
# net.ipv6.conf.default.disable_ipv6 = 1
Application-Specific Requirements
| Application | Required Parameter | Value |
|---|---|---|
| Redis | vm.overcommit_memory |
1 |
| Elasticsearch | vm.max_map_count |
262144 |
| PostgreSQL | vm.swappiness |
1 |
| PostgreSQL | vm.dirty_background_ratio |
5 |
| Nginx (high traffic) | net.core.somaxconn |
65535 |
| HAProxy | net.ipv4.ip_local_port_range |
1024 65535 |
| MongoDB | Disable THP | N/A (not sysctl) |
Making Changes Persistent
# Method 1: Drop-in files in /etc/sysctl.d/ (recommended)
cat > /etc/sysctl.d/99-custom.conf << 'EOF'
net.core.somaxconn = 65535
vm.swappiness = 10
fs.file-max = 2097152
EOF
# Load immediately
sysctl --system
# Method 2: Kernel command line (for parameters needed at boot)
# Add to bootloader config (GRUB or systemd-boot)
# Example: transparent_hugepage=never
# Method 3: modprobe options (for module parameters)
echo "options nf_conntrack hashsize=262144" > /etc/modprobe.d/nf_conntrack.conf
# Method 4: systemd-sysctl (loaded by systemd at boot)
# Files in /etc/sysctl.d/ are loaded by systemd-sysctl.service
systemctl status systemd-sysctl
# Verify a parameter is set correctly after reboot
sysctl net.core.somaxconn
Monitoring and Validation
#!/bin/bash
# sysctl-audit.sh - Verify sysctl parameters match expected values
set -euo pipefail
declare -A EXPECTED=(
["net.ipv4.ip_forward"]="1"
["net.core.somaxconn"]="65535"
["vm.swappiness"]="10"
["fs.file-max"]="2097152"
["kernel.randomize_va_space"]="2"
["net.ipv4.tcp_syncookies"]="1"
["net.bridge.bridge-nf-call-iptables"]="1"
)
failed=0
for param in "${!EXPECTED[@]}"; do
actual=$(sysctl -n "$param" 2>/dev/null || echo "NOT SET")
expected="${EXPECTED[$param]}"
if [[ "$actual" != "$expected" ]]; then
echo "MISMATCH: $param = $actual (expected: $expected)"
(( failed++ ))
fi
done
if (( failed > 0 )); then
echo "$failed parameters do not match expected values"
exit 1
else
echo "All parameters match expected values"
fi
Kernel tuning is especially important when running Docker containers in production. Each container shares the host kernel, so sysctl parameters affect all containers simultaneously. When managing Docker hosts with usulnet, you can monitor host-level metrics that reflect kernel performance parameters -- helping you identify when tuning adjustments are needed.
The golden rules of kernel tuning: First, understand what each parameter does before changing it. Second, change one parameter at a time and measure the impact. Third, make every change persistent and documented. A server that works perfectly until a reboot because someone forgot to persist a sysctl change is a time bomb waiting to go off.