The Kernel Tax: IPC Latency and the Cost of Boundary Crossings

The latency floor of an IPC channel is set by the number of user/kernel boundary crossings per round trip, each carrying a minimum tax of 101 ns on the hardware, a tax that shared memory reduces to exactly zero.

Abstract

The lock-free queue eliminates the mutex. It does not eliminate the kernel. This distinction is not semantic. It is the difference between 56 nanoseconds and 6 microseconds.

When two processes exchange data over a pipe or a Unix domain socket, every read and every write is a privilege level transition: CPL3 to CPL0 and back. On an Intel i7-12650H with KPTI active, I isolated the minimum cost of a single such transition at 101 ns using a getpid loop as a zero-payload baseline. A pipe round-trip requires four of these transitions. The kernel charges that tax unconditionally, on every message, at every throughput level, regardless of how lock-free the surrounding code is.

I benchmarked 1,000,000 ping-pong round-trips at 32 bytes across two transports under identical conditions: cores 8 and 9 pinned, SCHED_FIFO priority 99, RDTSCP-serialized measurement. perf stat confirms the crossing arithmetic exactly: 2,200,000 sys_enter_write and 2,200,000 sys_enter_read for pipe, 2,200,000 sys_enter_sendto and 2,200,000 sys_enter_recvfrom for Unix domain socket. Four crossings per round-trip, without exception. The measured cost follows directly: pipe p50 at 6,236 ns, Unix socket p50 at 4,611 ns.

Tachyon, a shared-memory IPC library I authored, reaches 56.5 ns p50 on the same hardware. Its hot path contains no syscall instruction. The kernel participates exactly once per session: a Unix domain socket handshake that transfers an anonymous memfd file descriptor via SCM_RIGHTS, after which the socket is permanently discarded. All subsequent I/O operates directly in the shared memory segment via acquire/release atomics on a lock-free SPSC ring. The crossing count on the hot path is not reduced. It is zero.

This article dissects the mechanical cost across two failure modes: the pipe boundary tax and the socket boundary tax. Correlating RTT measurements with exact syscall counts from perf stat and the per-crossing floor from the getpid baseline, I map the precise relationship between crossing count and observed latency. The 110x delta between pipe and Tachyon and the 82x delta between Unix socket and Tachyon are not caused by a faster algorithm or a smarter buffer. They are caused by the absence of the instruction syscall from the hot path.

The kernel is an extraordinary piece of engineering. It should not be on your data path.

1. The Tax Notice

When a developer profiles an IPC-heavy system and finds latency above acceptable bounds, the standard response is to reach for a lock-free queue. Remove the mutex, eliminate the contention, recover the throughput. The logic is sound. The scope is wrong.

A mutex is an OS-level synchronization primitive. Removing it eliminates one category of kernel involvement. It does not eliminate the others.

The Pipe

The pipe is the oldest IPC primitive on Unix. Its interface is two file descriptors, its semantics are a byte stream, and its cost is invisible until measured.

I passed 32-byte payloads through a ping-pong loop: producer writes, consumer reads and echoes, producer reads the echo. One million round-trips, cores 8 and 9 pinned, SCHED_FIFO priority 99.

Pipe RTT (1,000,000 samples, 32 bytes)
  Min       3,426 ns
  p50       6,236 ns
  p90       6,651 ns
  p99       7,411 ns
  p99.9    10,298 ns

6.2 microseconds at p50 for 32 bytes between two processes on the same die. On a 2.7 GHz P-core, that is 16,700 cycles spent moving data across a file descriptor boundary. Before examining the mechanism, the number alone should provoke a question: where did the cycles go?

The Unix Domain Socket

The Unix domain socket is the conventional upgrade from pipe for structured IPC. Same machine, same kernel, different abstraction.

Unix Socket RTT (1,000,000 samples, 32 bytes)
  Min       3,573 ns
  p50       4,611 ns
  p90       4,834 ns
  p99       8,730 ns
  p99.9     9,842 ns

4.6 microseconds. Faster than pipe at the median, but the same fundamental cost structure: the data crosses a kernel boundary on every send and again on every receive.

The Actual Floor

Both transports were benchmarked against Tachyon, a shared-memory IPC library I built, on the same machine under identical pinning conditions.

Tachyon RTT (1,000,000 samples, 32 bytes)
  Min       51,3 ns
  p50       56,5 ns
  p90       101,2 ns
  p99       112,4 ns
  p99.9     122 ns

Pipe is 110x slower. Unix socket is 82x slower. The payload is identical. The cores are the same. The ring buffer semantics are equivalent. The only structural difference is whether the data path crosses the user/kernel boundary or not.

The rest of this article is the proof of that single claim.

2. The Cost of a Privilege Transition

The syscall instruction is not expensive because it is complex. It is expensive because it is a contract between two worlds that were designed to be isolated from each other.

When a process executes syscall, the CPU does not simply jump to a kernel function. It executes a state transition enforced in silicon:

The privilege level switches from CPL3 to CPL0. The CPU loads the kernel stack pointer from the IA32_LSTAR and IA32_STAR MSRs.
SWAPGS exchanges the user-space GS base with the kernel’s GS base, giving the kernel access to its per-CPU data structures.
On kernels with KPTI active (the Meltdown mitigation present on virtually every production deployment since 2018), the CPU switches CR3 to the kernel page table, flushing a portion of the TLB in the process.
The kernel handler executes.
The return path mirrors all of this in reverse: CR3 switches back to the user page table, SWAPGS restores user state, privilege returns to CPL3.

To isolate the cost of this sequence without payload interference, I ran getpid in a tight loop: one syscall per iteration, no data copy, no scheduling, no blocking.

uint64_t t0 = __rdtsc();
for (int i = 0; i < 10'000'000; ++i) {
    getpid();
}
uint64_t t1 = __rdtsc();
printf("getpid avg: %.1f ns\n", (t1 - t0) * 0.3721 / N);

getpid avg: 101.0 ns

101 nanoseconds. That is the hardware floor for a single CPL3->CPL0->CPL3 round-trip on this machine with KPTI active, under zero memory pressure, with a hot TLB and no competing threads.

What 101 ns Predicts, and What It Does Not

A pipe round-trip has four crossings. The arithmetic predicts 4 x 101 = 404 ns. The measurement delivers 6,236 ns. The ratio is 15x.

The gap is not measurement error. The getpid baseline strips away every cost that is not the privilege transition itself. A real write syscall does more: it validates the file descriptor, acquires the pipe’s internal lock, copies the payload into the kernel pipe buffer, and wakes any process sleeping on the read end. read mirrors that sequence. Both crossings carry data. Both crossings interact with kernel data structures that may not be in the L1 cache of either core.

The 101 ns is the tax rate. The 6,236 ns is the full bill.

Counting the Crossings

perf stat makes the crossing count exact and incontrovertible:

sudo perf stat -e syscalls:sys_enter_write,syscalls:sys_enter_read ./bench_ipc pipe

  2,200,001   syscalls:sys_enter_write
  2,200,005   syscalls:sys_enter_read

2,200,000 is (100,000 warmup + 1,000,000 samples) x 2. The count is exact. Every single round-trip in the benchmark crossed the kernel boundary four times. There are no batched sends, no deferred flushes, no path through which a message transits without a privilege transition.

sudo perf stat -e syscalls:sys_enter_sendto,syscalls:sys_enter_recvfrom ./bench_ipc uds

  2,200,000   syscalls:sys_enter_sendto
  2,200,001   syscalls:sys_enter_recvfrom

Identical arithmetic for the Unix domain socket. The abstraction is different. The crossing count is not.

These two perf stat outputs are the forensic record. The latency in section 1 is not a profiling artifact or a scheduler anomaly. It is the deterministic consequence of four mandatory privilege transitions per round-trip, each carrying its own minimum tax.

3. Inside the Crossing: What the Kernel Actually Does

The arithmetic is 4 x 101 = 404 ns. The measurement is 6,236 ns. The ratio is 15x. The 101 ns getpid baseline is the floor of the privilege transition in isolation: no data, no contention, no waiting. A real write on a pipe is not a bare privilege transition. It is a sequence of kernel operations, each carrying its own cost, stacked on top of that floor.

The Pipe Write Path

When write enters the kernel with a 32-byte payload destined for a pipe, the kernel executes pipe_write in fs/pipe.c. The first operation is a mutex acquisition: pipe->mutex is a full sleeping mutex, not a spinlock. If the consumer process is currently inside pipe_read holding the same mutex, the producer blocks immediately and a context switch follows.

Once the lock is acquired, the kernel checks whether the pipe buffer has space. A pipe buffer on Linux defaults to 16 pages (65,536 bytes). For a 32-byte payload there is always space, so no blocking occurs on the capacity check. The kernel copies the payload from user space into the pipe’s internal buffer via copy_from_user. This is not a memcpy. copy_from_user includes a fault handler path, an access check, and must cross the user/kernel address space boundary under whatever page table configuration KPTI enforces. For 32 bytes the copy itself is cheap. The overhead is the machinery surrounding it.

After the copy, the kernel must wake the consumer. pipe_write calls wake_up_interruptible on the pipe’s read wait queue. This traverses the wait queue list, finds the consumer’s task_struct, and calls try_to_wake_up. If the consumer is sleeping on a different physical core, try_to_wake_up must send an inter-processor interrupt to notify that core’s scheduler. The scheduler on Core 9 receives the IPI, evaluates the consumer’s priority, and schedules it. The consumer core exits its idle state, restores the consumer process context, and begins executing pipe_read.

pipe_read mirrors this sequence: mutex acquisition, copy_to_user from the pipe buffer to user space, wake the producer’s wait queue. Two more crossings, two more copies, one more mutex cycle.

The total is not four privilege transitions. It is four privilege transitions plus two mutex acquisitions, four address-space copies, two wait queue traversals, and at minimum one round of cross-core scheduler signaling per round-trip. The 5,832 ns above the 404 ns floor is not waste. It is the cost of a correctly implemented kernel IPC mechanism doing exactly what it was designed to do.

Why the Unix Domain Socket is Faster

The Unix domain socket at p50 is 4,611 ns against the pipe’s 6,236 ns. The 26% reduction has a specific mechanical cause.

A Unix domain socket in SOCK_STREAM mode uses a per-socket spinlock rather than a sleeping mutex for its internal state. On a lightly contended path between two processes on adjacent cores, a spinlock resolves in tens of nanoseconds rather than hundreds. The pipe mutex can block and yield the CPU. The socket spinlock spins and returns.

The buffer management also differs. unix_stream_sendmsg in the kernel writes into an sk_buff chain managed per-socket, which avoids the page-aligned pipe buffer accounting overhead for small payloads. For 32 bytes, the socket path through the kernel is shorter by several function call levels than pipe_write.

The fundamental cost structure is identical: four CPL3 to CPL0 crossings, four copies, scheduler involvement on wakeup. The socket is faster because each of those operations is lighter, not because any of them is absent.

This is the distinction that matters. Optimizing within the kernel boundary reduces the per-crossing overhead. It does not change the crossing count. At 56.5 ns, Tachyon is not a faster implementation of the same mechanism. It is a different mechanism.

4. The One-Shot Bootstrap

Tachyon uses a Unix domain socket exactly once. Not as a data channel. As a syringe.

The connection sequence is a single sendmsg call that transfers one file descriptor from producer to consumer via SCM_RIGHTS. SCM_RIGHTS is a Linux ancillary data mechanism that passes open file descriptors between processes through the kernel’s file descriptor table, without copying the underlying resource. The producer holds an anonymous memfd region. After the handshake, both processes map the same physical pages.

struct msghdr msg{};
char buf[CMSG_SPACE(sizeof(int))] = {};

struct iovec io = { const_cast<void *>(
    reinterpret_cast<const void *>(&handshake)), sizeof(TachyonHandshake) };

msg.msg_iov        = &io;
msg.msg_iovlen     = 1;
msg.msg_control    = buf;
msg.msg_controllen = sizeof(buf);

struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
cmsg->cmsg_level     = SOL_SOCKET;
cmsg->cmsg_type      = SCM_RIGHTS;
cmsg->cmsg_len       = CMSG_LEN(sizeof(int));
std::memcpy(CMSG_DATA(cmsg), &shm_fd, sizeof(int));

::sendmsg(client_sock, &msg, 0);

::close(client_sock);
::close(sock);
::unlink(addr.sun_path);

The three lines after sendmsg are the architectural statement. The client socket is closed. The listening socket is closed. The socket file is unlinked from the filesystem. From this point forward, the socket path does not exist. A second connect attempt returns ENOENT.

The handshake payload is a 20-byte struct:

struct TachyonHandshake {
    uint32_t magic;         // 0x54414348 ("TACH")
    uint32_t version;       // 0x02
    uint32_t capacity;      // ring size in bytes
    uint32_t shm_size;      // sizeof(MemoryLayout) + capacity
    uint32_t msg_alignment; // TACHYON_MSG_ALIGNMENT (64)
};

If magic, version, or msg_alignment do not match on the consumer side, the connection is rejected before the first byte of data is exchanged. The ABI contract is enforced at connection time, not at runtime.

The memfd created by the producer is anonymous. It has no filesystem path. It cannot be opened by a third process. It exists only as a file descriptor in the kernel’s file table, mapped into both address spaces via mmap(MAP_SHARED). On Linux, memfd_create is called with MFD_ALLOW_SEALING, and F_SEAL_SHRINK | F_SEAL_GROW | F_SEAL_SEAL are applied immediately after ftruncate. The region cannot be resized by either party after the handshake completes.

This is the only moment the kernel is involved in the data path. One sendmsg. One recvmsg. The socket is discarded. Everything that follows operates in user space.

5. The Silent Hot Path

The difference between 56 ns and 6,236 ns is visible in the binary.

Every write call in bench_ipc resolves through three levels of indirection. The call site in the producer loop jumps to the PLT stub:

call   4003a0 <write@plt>

The PLT stub resolves to __write in libc:

__write:
    endbr64
    movsxd rdi, edi
    xor    r9d, r9d
    xor    r8d, r8d
    xor    ecx, ecx
    sub    rsp, 0x8
    push   0x1
    call   6ec70 <__syscall_cancel>     ; privilege transition

__syscall_cancel issues the syscall instruction. The CPU switches from CPL3 to CPL0. The kernel validates the file descriptor, acquires the pipe’s internal spinlock, copies the payload into the kernel pipe buffer, and wakes the consumer if it is sleeping. The return path restores user state. On a kernel with KPTI active, CR3 switches twice per crossing.

This executes four times per round-trip. There is no path through which a message transits a pipe without it.

The Tachyon Commit

tachyon_commit_tx writes the message header into the ring and increments the batch counter. Under pure-spin mode with fewer than 32 pending messages, this is the complete execution path:

tachyon_commit_tx:
    mov    rax, QWORD PTR [rdi+0x80]        ; load tx_reserved_size
    lea    rcx, [rax-0x40]
    cmp    rsi, rcx                         ; validate actual_size <= reserved - header
    ja     5448                             ; reject if oversized
    mov    rcx, QWORD PTR [rdi+0x40]        ; load shm base
    mov    r8,  QWORD PTR [rdi+0x48]        ; load local_head
    and    r8,  QWORD PTR [rdi+0x50]        ; apply capacity mask
    mov    DWORD PTR [rcx+r8*1+0x300], esi  ; write size
    mov    DWORD PTR [rcx+r8*1+0x304], edx  ; write type_id
    mov    DWORD PTR [rcx+r8*1+0x308], eax  ; write reserved_size
    add    QWORD PTR [rdi+0x50], rcx        ; advance local_head
    mov    QWORD PTR [rdi+0x80], 0x0        ; clear tx_reserved_size
    inc    rax                              ; pending_tx++
    cmp    rax, 0x20                        ; pending_tx < 32?
    jae    546d                             ; flush if batch full
    ret                                     ; <- returns here on hot path

No call. No syscall. No privilege transition. The message header is written to the ring buffer with three MOV instructions and the function returns. The kernel has no knowledge this occurred.

The syscall instruction exists in tachyon_commit_tx, at address 0x54e5. It is a futex(FUTEX_WAKE). It is reached only when two conditions are simultaneously true: the batch counter has reached 32, and the consumer has set its consumer_sleeping flag to CONSUMER_SLEEPING rather than CONSUMER_PURE_SPIN. In a benchmarking configuration with both sides in pure-spin mode, the branch at 0x54a5 never fires. The lock or DWORD PTR [rsp-0x40], 0x0 at 0x54a7, the x86 encoding of a full memory barrier, is equally skipped. The hot path is MOV instructions and a comparison.

This is not an optimization. It is the consequence of removing the kernel from the data path entirely during the bootstrap phase. There is nothing to call. The shared memory region is already mapped in both address spaces. The consumer is already polling. The producer writes, increments, and returns.

perf stat makes the absence countable:

pipe  (1,100,000 round-trips): 2,200,001 sys_enter_write + 2,200,005 sys_enter_read
Tachyon (1,000,000 messages):  0 sys_enter on hot path

The crossing count is not reduced. It is zero.

Conclusion: Count the Crossings

The 110x delta between pipe and Tachyon is not a benchmark artifact. It is long division.

A pipe round-trip has four user/kernel boundary crossings. Each crossing carries a minimum hardware tax of 101 ns, measured in isolation on this machine. Four crossings plus kernel buffer management, scheduler wakeups, and TLB pressure under KPTI produce a p50 of 6,236 ns. The arithmetic is not subtle. The measurement confirms it to within the expected margin.

Tachyon has zero crossings on the hot path. perf stat confirms zero sys_enter events during message transfer. The assembly confirms no call to any syscall wrapper in the commit path. The 56.5 ns p50 is the cost of writing three DWORD values into a shared memory region, incrementing a counter, and returning. That is the entire operation.

The practical consequence for latency-sensitive systems is direct:

Count your crossings before profiling anything else. A single write per message is a syscall. A single read is another. If your IPC path calls into the kernel on every message, the microsecond floor is structural, not tunable. No thread priority, no CPU isolation, no compiler flag removes it.
Lock-free is not kernel-free. An atomic queue between threads needs no syscall. An atomic queue between processes over a pipe or socket does. The synchronization primitive is not the cost. The transport is.
The bootstrap cost is paid once. The Unix domain socket handshake that transfers the memfd is a syscall. It happens once per session. Amortized across 1,000,000 messages, it contributes zero nanoseconds to the per-message latency. Design your IPC so the kernel is present at connection time and absent at runtime.
Verify with perf stat. syscalls:sys_enter_write, syscalls:sys_enter_read, syscalls:sys_enter_sendto, syscalls:sys_enter_recvfrom. If these counters scale linearly with your message count, the kernel is on your critical path. The fix is not in the code. It is in the transport.

The kernel is correct, reliable, and battle-tested across decades of production workloads. It is also, on a modern P-core, 101 nanoseconds per crossing at minimum. In a system exchanging a million messages per second, that minimum compounds to 404 ms of mandatory kernel time per second of operation, before a single byte of payload is processed.

Shared memory does not make IPC faster. It invoices the kernel at connection time and never again.

The crossing count is the latency floor. Make it zero.

Abstract#

1. The Tax Notice#

The Pipe#

The Unix Domain Socket#

The Actual Floor#

2. The Cost of a Privilege Transition#

What 101 ns Predicts, and What It Does Not#

Counting the Crossings#

3. Inside the Crossing: What the Kernel Actually Does#

The Pipe Write Path#

Why the Unix Domain Socket is Faster#

4. The One-Shot Bootstrap#

5. The Silent Hot Path#

The Tachyon Commit#

Conclusion: Count the Crossings#