Portunus

Performance Report

Benchmark methodology, current Portunus measurements, comparable forwarding stacks, and reproducible test flow.

This report captures the current v1.6.1 performance. Read the numbers as same-host evidence — measurements taken on one specific machine — rather than as universal throughput claims. Forwarding speed is dominated by the kernel version, the CPU frequency policy, the network path, socket buffer sizing, and whether traffic stays inside the kernel or has to cross into userspace (the application's own memory).

For TCP, the v1.6.1 data plane is unchanged from v1.3.0, the release that introduced the splice fast path. This revision re-measures the same host end-to-end and adds a direct portunus-standalone vs portunus-client comparison, because both binaries ship the same portunus-forwarder data plane. The earlier v1.3.0 and v0.11 baselines are preserved on the Performance History page.

Quick read. On a plain TCP rule with no bandwidth limit, Portunus keeps pace with the kernel's own iptables forwarding all the way to 20 Gbit/s, and the choice between the standalone and client builds makes no measurable difference to throughput.

Bench host

All v1.6.1 numbers below were captured on:

Linux host 6.12.38+deb13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.12.38-1 x86_64
AMD EPYC 7B13 (4 vCPU), 7.8 GiB RAM
rustc 1.96.0 (ac68faa20 2026-05-25)
iperf 3.18 (cJSON 1.7.15)

This is the same 4-vCPU Debian 13 / kernel 6.12 machine that produced the v0.11 / v1.3.0 baselines. Read each result as a same-machine, back-to-back deltaiperf3 vs iptables vs Portunus in one run — not as an absolute number to compare across reports, since the toolchain and kernel point release move between sessions.

All tests run over the loopback interface (lo — traffic that never leaves the machine), so the network itself is never the bottleneck and we measure the forwarder's own cost.

v1.6.1 Headline (Linux TCP splice fast path)

Single 30-second TCP runs with no bandwidth limit (--seconds 30 --omit-seconds 5, where --omit-seconds discards the first seconds so TCP slow-start does not skew the average), splice ON (the default), measured back-to-back against the same iperf3 target:

Each forwarder is its own column, so you can compare them side by side against the iptables baseline:

iptables REDIRECT (baseline)portunus-standaloneportunus-clientiperf3
Throughput26,987 Mbit/s31,936 Mbit/s26,644 Mbit/s30,890 Mbit/s
vs iptables100.0%118.3%98.7%114.5%

We take iptables REDIRECT as the 100% baseline: it is the Linux kernel forwarding the connection entirely in-kernel (a static NAT redirect — no userspace process touches the bytes), which is the fastest a forwarder can realistically be on this host. The vs iptables row is each column's throughput divided by that baseline — 100% means on par with kernel forwarding, above 100% means faster than it. So portunus-standalone runs at 118.3% of kernel iptables and portunus-client at 98.7% (on par). The iperf3 column — no forwarder at all — is the raw ceiling, at 114.5%.

With the splice fast path, Portunus keeps pace with — and in some runs exceeds — kernel iptables forwarding. splice is a Linux system call that moves bytes from one socket to another without copying them into the application ("zero copy"); Portunus pairs it with an internal kernel pipe so a forwarded TCP connection's data never passes through userspace. That avoids the read → memcpy → write round-trip that limits every ordinary userspace TCP program on this host — iperf3 included. The proxied path also runs as three processes (iperf client → Portunus → iperf server), which spread across the 4 vCPUs better than the direct two-thread iperf3 baseline can — which is why Portunus can come in slightly above 100%.

A rate-limited client rule (1,048,576 B/s cap) landed at 8.388 Mbit/s against an 8.389 Mbit/s target — inside the +10% acceptance bound, confirming the bandwidth cap is accurate.

The single-run uncapped maximum at ~30 Gbit/s is dominated by scheduler noise: at these speeds the result depends on which core the OS happens to place each process on, so it varies run to run. Read the columns above as "all three are in the same range", not as a fixed ranking — see the standalone vs client repeats below, where the order flips from one run to the next.

standalone vs client

portunus-standalone (driven by a static TOML file, with no control plane) and portunus-client (which adds a bidirectional gRPC control stream that receives pushed rules) consume the same portunus-forwarder data plane end-to-end — the same proxy.rs / splice hot path. Once a rule is active and bytes are flowing, the forwarding code is byte-for-byte identical.

To tell a real difference apart from loopback noise, we ran six back-to-back uncapped A/B repeats (--seconds 10 --omit-seconds 2), plus the 30-second headline run — seven pairs in total:

Runportunus-standaloneportunus-clientstandalone / client
headline (30s)31,93626,644119.9%
rep 132,52835,50791.6%
rep 226,25034,78275.5%
rep 314,61627,31153.5%
rep 435,80124,801144.4%
rep 535,02532,886106.5%
rep 627,52527,086101.6%
median~30,027~30,099~100%

The standalone-to-client ratio swings from 53.5% to 144.4% — sometimes standalone wins, sometimes client does — yet the two medians land within ~0.2% of each other. There is no systematic data-plane throughput difference between the two builds. The wide swing is simply the uncapped ceiling at ~30 Gbit/s being governed by scheduler and core-placement luck, not by which binary is under test.

The clean, low-noise proof is the offered-load sweep below: at every fixed paced rate from 100 Mbit/s to 20 Gbit/s, standalone and client both deliver the requested rate within < 1% of each other and of the iperf3 / iptables baselines.

The only structural difference is that portunus-client also runs the gRPC control stream and reports stats periodically. On a CPU-saturated host that is tiny background overhead — below the measurement noise of the data plane. For raw forwarding throughput, pick the build that fits your deployment model (central control plane vs static TOML); it is not a performance decision.

Offered-load sweep

A raw maximum answers "how fast can it possibly go?" The more useful operator question is "at my link speed, does the forwarder get in the way?" To answer it we use iperf3 -b <rate>, which paces the sender to a fixed rate — simulating a real WAN/VPS link of that speed — and check whether each path delivers it (--seconds 5 --omit-seconds 1). Same 11-point sweep, all four paths back-to-back:

Offered loadiperf3iptables REDIRECTportunus-standaloneportunus-client
100 Mbit/s100.02100.02100.02100.02
500 Mbit/s499.82499.92499.90500.05
1 Gbit/s999.94999.83999.82999.78
2.5 Gbit/s2,499.942,499.752,499.892,499.75
5 Gbit/s4,999.574,999.714,999.484,999.68
7.5 Gbit/s7,499.467,499.557,499.487,499.30
10 Gbit/s9,998.999,999.299,999.619,999.33
12.5 Gbit/s12,499.2712,499.2212,499.1712,501.03
15 Gbit/s14,999.0914,997.8414,999.1314,999.24
18 Gbit/s17,846.2518,001.8518,013.3418,030.70
20 Gbit/s19,998.8219,924.2819,481.1420,180.43

Across the entire 100 Mbit/s → 20 Gbit/s range, all four paths hit the offered rate within iperf3's short-run measurement noise. The splice fast path keeps both Portunus builds level with the in-kernel iptables REDIRECT baseline through the whole sweep — on this 4-vCPU host there is no link speed at which the richer control plane costs measurable throughput. The small variation in the 18–20 Gbit/s rows (e.g. standalone 19,481 vs client 20,180 at 20 Gbit/s) is short-run pacing noise, not a ranking — the same noise that drives the uncapped-max swing above.

This is the band the pre-splice v0.11 baseline could not hold: there, Portunus dropped to ~33–49% of iptables above 12.5 Gbit/s. The splice fast path (introduced in v1.3.0) closed that gap, and v1.6.1 re-confirms it on the same host.

What does NOT change with splice

  • Per-connection setup latency, half-close semantics, byte counters, Prometheus metrics, audit events, RBAC, and rate limiting are all unaffected. Capped rules stay on the original userspace path, byte-identical to v1.2.0. splice eligibility is decided per connection: it requires Linux && TCP && !PORTUNUS_DISABLE_SPLICE && no bandwidth cap on the rule or its owner. The concurrent_connections and new_connections_per_sec limits are checked when the connection is accepted and do not disable splice. See Rate Limiting & QoS — Interaction with the splice fast path.
  • Cross-platform behaviour is unchanged: macOS and Windows builds never use splice (it is gated to Linux with #[cfg(target_os = "linux")]; nm confirms zero splice symbols in the macOS release binary).
  • The operator surface is unchanged: no new rule field, no wire-protocol field, no Web UI control, no CLI flag. The PORTUNUS_DISABLE_SPLICE=1 environment variable exists only for troubleshooting and bench A/B testing (see Disabling the Linux fast path for triage).

Method

Mature proxy performance reports separate methodology from results:

  • Record the system under test: commit, build profile, CPU, OS, kernel, tool versions, and relevant sysctls.
  • Use release binaries. Disable debug logging on the hot path.
  • Measure a direct baseline first, then the proxy path on the same host.
  • Use warm-up / omitted seconds so TCP slow-start and process startup do not dominate the sample.
  • Report throughput, setup / RTT latency, connection behaviour, and any retransmits or rejects.
  • Report the uncapped loopback maximum as a range, not a single number — at tens of Gbit/s over loopback it is dominated by scheduler noise.
  • Keep regression gates separate from absolute marketing numbers. CI catches drift; dedicated hardware establishes product claims.

The repo uses three layers:

LayerCommandPurpose
Criterion TCP data planecargo bench -p portunus-client --bench data_plane -- --quickStable loopback proxy regression signal.
Criterion UDP data planecargo bench -p portunus-client --bench udp_data_plane -- --quickUDP steady-state and RTT regression signal.
Real-process comparescripts/perf_compare.pyReal portunus-server + portunus-client + portunus-standalone + iptables + iperf3 on one host.

scripts/perf_compare.py is the v1.6.1 test harness. It drives the current CLI end-to-end — server.toml operator_token bootstrap, the operator HTTP POST /v1/client-enrollments enrollment flow, portunus-client enroll + bundle + push-rule, and a TOML-driven portunus-standalone rule — measuring the iperf3, iptables REDIRECT, standalone, and client paths back-to-back. (The older scripts/perf_loopback.py predates the v1.6.1 one-time enrollment URI flow and the SQLite state.db file lock, so its provision-client / bundle path no longer runs.)

Interpretation

Portunus is a userspace L4 forwarder: it accepts a TCP connection, dials the target, and moves bytes between the two sockets. That makes it far richer than a plain kernel NAT rule, but it still has to cross the userspace boundary that the kernel does not.

How much each connection costs depends on what the rule asks for:

  • Plain TCP, no bandwidth cap, on Linux: the byte copy runs through the kernel splice + pipe path, so the payload never enters userspace. On this host the offered-load sweep shows both Portunus builds tracking iperf3 and iptables REDIRECT within iperf3 noise all the way to 20 Gbit/s.
  • Any bandwidth cap, or UDP, or macOS / Windows, or splice disabled: the original userspace path runs. For these cases, kernel NAT and nftables flowtables remain the performance ceiling — the userspace read → memcpy → write cycle adds a real per-connection cost.

Either way, Portunus is the right tool when you need remote client enrollment, central rule push, RBAC, an audit trail, per-owner QoS, metrics, SNI routing, the PROXY protocol, and a managed rule lifecycle. A static DNAT rule (a fixed kernel destination rewrite) gives you none of these — so comparing raw throughput alone is the wrong question, unless raw throughput is genuinely all you need.

Comparable Forwarders

DimensionPortunus (v1.6.1)iptables DNAT / MASQUERADEnftables + flowtablesNGINX streamHAProxy TCP modeEnvoy tcp_proxyrinetdsocatSSH -L / -R
Performance profileUserspace L4 with a Linux TCP splice fast path for uncapped flows. On the bench host tracks iptables REDIRECT to 20 Gbit/s; capped / UDP / non-Linux paths stay on the standard userspace copy.Highest. Kernel path, no userspace copy.Highest for eligible flows; bypasses parts of the classic forwarding path.High userspace proxy with mature event loop.Very high userspace TCP proxy with excellent connection handling.High but heavier userspace proxy, designed for service mesh / xDS.Lightweight userspace redirector.Flexible diagnostic pipe, not tuned as a managed production proxy.Encrypted tunnel; throughput pays SSH crypto overhead.
TCPYesYesYesYesYesYesYesYesYes
UDPYes (first-packet enforced)YesYesYesNo generic UDP forwardingDedicated UDP filters / use casesYesYesNo native UDP
Dynamic remote rule pushFirst-class: central server pushes signed rule bundles to edge clients over pinned TLS; CLI, operator HTTP API, embedded Web UI, hot-reload.No built-in control planeNo built-in control planeReload/API depends on edition/configRuntime API supports many operationsxDS control planeConfig reload styleNoPer-session
RBAC / audit / metricsNative per-user / per-client / per-protocol / per-port-range RBAC; structured audit trail; Prometheus metrics; embedded SQLite store.External onlyExternal onlyMetrics via modules; no native tenant RBACStrong stats, ACLs, stick tablesRich telemetry and policyMinimalMinimalSSH auth/logs
QoS / rate limitPer-rule and per-owner: bandwidth_in/out_bps, new_connections_per_sec, concurrent_connections. Token-bucket limiter; capped rules go through the standard userspace path.Basic shaping via tc / nftables ecosystemVia nftables / tc, not app-owner awarelimit_conn / limit_rate style controlsRich connection/rate controlsRich filters, overload managerMinimalMinimalMinimal
Best fitCentrally managed edge listeners with tenant-aware RBAC, per-owner QoS, observability, SNI dispatch, and PROXY protocol — TCP/UDP, single ports or ranges, IP or DNS targets.Static local host/network NAT.Modern Linux packet filtering and forwarding.Generic TCP/UDP proxying.L4/L7 load balancing where TCP is enough.Service mesh or xDS-managed environments.Simple port redirection baseline.Experiments and debugging.Operationally convenient encrypted tunnels.

A fair benchmark is not "compare published numbers." It is:

  1. The same host, or the same two-host topology.
  2. The same target application (iperf3 TCP, and UDP where supported).
  3. The same duration, warm-up, socket buffers, CPU governor, MTU, and kernel.
  4. A direct baseline first.
  5. One forwarding implementation at a time.

Reproduction

Local / VPS compare

# Debian/Ubuntu
sudo apt-get update
sudo apt-get install -y build-essential cmake pkg-config protobuf-compiler \
  git curl iperf3 iptables

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source "$HOME/.cargo/env"

git clone https://github.com/ZingerLittleBee/Portunus.git
cd Portunus
PORTUNUS_SKIP_WEBUI=1 cargo build --release \
  -p portunus-standalone -p portunus-server -p portunus-client

# /tmp may be tmpfs; portunus refuses a tmpfs-backed state.db. Use $HOME.
export TMPDIR="$HOME"

# Headline: direct / iptables / standalone / client + capped convergence.
# --with-iptables is Linux-only and needs root or passwordless sudo.
python3 scripts/perf_compare.py --seconds 30 --omit-seconds 5 \
  --with-iptables --cap-bytes-per-sec 1048576 \
  --json-out perf-headline.json

# Offered-load sweep across all four paths.
python3 scripts/perf_compare.py --seconds 5 --omit-seconds 1 \
  --with-iptables --cap-bytes-per-sec 0 \
  --offered-mbps 100,500,1000,2500,5000,7500,10000,12500,15000,18000,20000 \
  --json-out perf-sweep.json

Useful flags: --skip-client (standalone only), --skip-standalone (client only), and --server-bin / --client-bin / --standalone-bin to point at prebuilt binaries. The headline JSON looks like:

{
  "direct": { "mbps": 30889.985, "retransmits": 147 },
  "iptables_redirect": { "mbps": 26987.079, "retransmits": 16 },
  "iptables_vs_direct_pct": 87.37,
  "standalone_uncapped": { "mbps": 31935.997, "retransmits": 9 },
  "standalone_vs_direct_pct": 103.39,
  "standalone_vs_iptables_pct": 118.34,
  "client_uncapped": { "mbps": 26644.098, "retransmits": 4 },
  "client_vs_direct_pct": 86.25,
  "client_vs_iptables_pct": 98.73,
  "standalone_vs_client_pct": 119.86,
  "client_capped": {
    "cap_bytes_per_sec": 1048576, "mbps": 8.388,
    "target_mbps": 8.389, "within_plus_10pct": true
  }
}

For a two-host test, run portunus-server on the control host, run portunus-client (or portunus-standalone) and iperf3 -s on the edge/target host, then drive iperf3 -c <edge-listen-ip> -p <listen_port> -t 30 -O 5 -J from a third host. Keep a direct iperf3 to the target as the baseline.

Criterion Regression

cargo bench -p portunus-client --bench data_plane
python3 scripts/bench_regression_gate.py --max-regression-pct 50

cargo bench -p portunus-client --bench udp_data_plane -- --quick
cargo bench -p portunus-server --bench operator_api -- --quick

Historical baselines

The pre-v1.6.1 reference numbers — the v1.3.0 splice-introduction tables (measured on the original Debian 13 host) and the v0.11 pre-splice Linux iptables comparison — live on their own page: Performance History. Consult it only if you need the prior reference for traceability.

Sources

On this page