“The latency floor of an IPC channel is set by the number of user/kernel boundary crossings per round trip, each carrying a minimum tax of 101 ns on the hardware, a tax that shared memory reduces to exactly zero.”
Abstract
The lock-free queue eliminates the mutex. It does not eliminate the kernel. This distinction is not semantic. It is the difference between 56 nanoseconds and 6 microseconds.
When two processes exchange data over a pipe or a Unix domain socket, every read and every write is a privilege
level transition: CPL3 to CPL0 and back. On an Intel i7-12650H with KPTI active, I isolated the minimum cost of a single
such transition at 101 ns using a getpid loop as a zero-payload baseline. A pipe round-trip requires four of these
transitions. The kernel charges that tax unconditionally, on every message, at every throughput level, regardless of how
lock-free the surrounding code is.
I benchmarked 1,000,000 ping-pong round-trips at 32 bytes across two transports under identical conditions: cores 8 and
9 pinned, SCHED_FIFO priority 99, RDTSCP-serialized measurement. perf stat confirms the crossing arithmetic exactly:
2,200,000 sys_enter_write and 2,200,000 sys_enter_read for pipe, 2,200,000 sys_enter_sendto and 2,200,000
sys_enter_recvfrom for Unix domain socket. Four crossings per round-trip, without exception. The measured cost follows
directly: pipe p50 at 6,236 ns, Unix socket p50 at 4,611 ns.
Tachyon, a shared-memory IPC library I authored, reaches 56.5 ns p50 on the same hardware. Its hot path contains no
syscall instruction. The kernel participates exactly once per session: a Unix domain socket handshake that transfers
an anonymous memfd file descriptor via SCM_RIGHTS, after which the socket is permanently discarded. All subsequent
I/O operates directly in the shared memory segment via acquire/release atomics on a lock-free SPSC ring. The crossing
count on the hot path is not reduced. It is zero.
This article dissects the mechanical cost across two failure modes: the pipe boundary tax and the socket boundary
tax. Correlating RTT measurements with exact syscall counts from perf stat and the per-crossing floor from the
getpid baseline, I map the precise relationship between crossing count and observed latency. The 110x delta
between pipe and Tachyon and the 82x delta between Unix socket and Tachyon are not caused by a faster algorithm or a
smarter buffer. They are caused by the absence of the instruction syscall from the hot path.
The kernel is an extraordinary piece of engineering. It should not be on your data path.
1. The Tax Notice
When a developer profiles an IPC-heavy system and finds latency above acceptable bounds, the standard response is to reach for a lock-free queue. Remove the mutex, eliminate the contention, recover the throughput. The logic is sound. The scope is wrong.
A mutex is an OS-level synchronization primitive. Removing it eliminates one category of kernel involvement. It does not eliminate the others.
The Pipe
The pipe is the oldest IPC primitive on Unix. Its interface is two file descriptors, its semantics are a byte stream, and its cost is invisible until measured.
I passed 32-byte payloads through a ping-pong loop: producer writes, consumer reads and echoes, producer reads the echo. One million round-trips, cores 8 and 9 pinned, SCHED_FIFO priority 99.
Pipe RTT (1,000,000 samples, 32 bytes)
Min 3,426 ns
p50 6,236 ns
p90 6,651 ns
p99 7,411 ns
p99.9 10,298 ns
6.2 microseconds at p50 for 32 bytes between two processes on the same die. On a 2.7 GHz P-core, that is 16,700 cycles spent moving data across a file descriptor boundary. Before examining the mechanism, the number alone should provoke a question: where did the cycles go?
The Unix Domain Socket
The Unix domain socket is the conventional upgrade from pipe for structured IPC. Same machine, same kernel, different abstraction.
Unix Socket RTT (1,000,000 samples, 32 bytes)
Min 3,573 ns
p50 4,611 ns
p90 4,834 ns
p99 8,730 ns
p99.9 9,842 ns
4.6 microseconds. Faster than pipe at the median, but the same fundamental cost structure: the data crosses a kernel boundary on every send and again on every receive.
The Actual Floor
Both transports were benchmarked against Tachyon, a shared-memory IPC library I built, on the same machine under identical pinning conditions.
Tachyon RTT (1,000,000 samples, 32 bytes)
p50 56.5 ns
Pipe is 110x slower. Unix socket is 82x slower. The payload is identical. The cores are the same. The ring buffer semantics are equivalent. The only structural difference is whether the data path crosses the user/kernel boundary or not.
The rest of this article is the proof of that single claim.
2. The Cost of a Privilege Transition
The syscall instruction is not expensive because it is complex. It is expensive because it is a contract between two
worlds that were designed to be isolated from each other.
When a process executes syscall, the CPU does not simply jump to a kernel function. It executes a state transition
enforced in silicon:
- The privilege level switches from CPL3 to CPL0. The CPU loads the kernel stack pointer from the
IA32_LSTARandIA32_STARMSRs. SWAPGSexchanges the user-spaceGSbase with the kernel’sGSbase, giving the kernel access to its per-CPU data structures.- On kernels with KPTI active (the Meltdown mitigation present on virtually every production deployment since 2018), the CPU switches CR3 to the kernel page table, flushing a portion of the TLB in the process.
- The kernel handler executes.
- The return path mirrors all of this in reverse: CR3 switches back to the user page table,
SWAPGSrestores user state, privilege returns to CPL3.
To isolate the cost of this sequence without payload interference, I ran getpid in a tight loop: one syscall per
iteration, no data copy, no scheduling, no blocking.
uint64_t t0 = __rdtsc();
for (int i = 0; i < 10'000'000; ++i) {
getpid();
}
uint64_t t1 = __rdtsc();
printf("getpid avg: %.1f ns\n", (t1 - t0) * 0.3721 / N);
getpid avg: 101.0 ns
101 nanoseconds. That is the hardware floor for a single CPL3->CPL0->CPL3 round-trip on this machine with KPTI active, under zero memory pressure, with a hot TLB and no competing threads.
What 101 ns Predicts, and What It Does Not
A pipe round-trip has four crossings. The arithmetic predicts 4 x 101 = 404 ns. The measurement delivers 6,236 ns. The
ratio is 15x.
The gap is not measurement error. The getpid baseline strips away every cost that is not the privilege transition
itself. A real write syscall does more: it validates the file descriptor, acquires the pipe’s internal lock, copies
the payload into the kernel pipe buffer, and wakes any process sleeping on the read end. read mirrors that sequence.
Both crossings carry data. Both crossings interact with kernel data structures that may not be in the L1 cache of either
core.
The 101 ns is the tax rate. The 6,236 ns is the full bill.
Counting the Crossings
perf stat makes the crossing count exact and incontrovertible:
sudo perf stat -e syscalls:sys_enter_write,syscalls:sys_enter_read ./bench_ipc pipe
2,200,001 syscalls:sys_enter_write
2,200,005 syscalls:sys_enter_read
2,200,000 is (100,000 warmup + 1,000,000 samples) x 2. The count is exact. Every single round-trip in the benchmark
crossed the kernel boundary four times. There are no batched sends, no deferred flushes, no path through which a message
transits without a privilege transition.
sudo perf stat -e syscalls:sys_enter_sendto,syscalls:sys_enter_recvfrom ./bench_ipc uds
2,200,000 syscalls:sys_enter_sendto
2,200,001 syscalls:sys_enter_recvfrom
Identical arithmetic for the Unix domain socket. The abstraction is different. The crossing count is not.
These two perf stat outputs are the forensic record. The latency in section 1 is not a profiling artifact or a
scheduler anomaly. It is the deterministic consequence of four mandatory privilege transitions per round-trip, each
carrying its own minimum tax.
3. Inside the Crossing: What the Kernel Actually Does
The arithmetic is 4 x 101 = 404 ns. The measurement is 6,236 ns. The ratio is 15x. The 101 ns getpid baseline is
the floor of the privilege transition in isolation: no data, no contention, no waiting. A real write on a pipe is not
a bare privilege transition. It is a sequence of kernel operations, each carrying its own cost, stacked on top of that
floor.
The Pipe Write Path
When write enters the kernel with a 32-byte payload destined for a pipe, the kernel executes pipe_write in
fs/pipe.c. The first operation is a mutex acquisition: pipe->mutex is a full sleeping mutex, not a spinlock. If the
consumer process is currently inside pipe_read holding the same mutex, the producer blocks immediately and a context
switch follows.
Once the lock is acquired, the kernel checks whether the pipe buffer has space. A pipe buffer on Linux defaults to 16
pages (65,536 bytes). For a 32-byte payload there is always space, so no blocking occurs on the capacity check. The
kernel copies the payload from user space into the pipe’s internal buffer via copy_from_user. This is not a memcpy.
copy_from_user includes a fault handler path, an access check, and must cross the user/kernel address space boundary
under whatever page table configuration KPTI enforces. For 32 bytes the copy itself is cheap. The overhead is the
machinery surrounding it.
After the copy, the kernel must wake the consumer. pipe_write calls wake_up_interruptible on the pipe’s read wait
queue. This traverses the wait queue list, finds the consumer’s task_struct, and calls try_to_wake_up. If the
consumer is sleeping on a different physical core, try_to_wake_up must send an inter-processor interrupt to notify
that core’s scheduler. The scheduler on Core 9 receives the IPI, evaluates the consumer’s priority, and schedules it.
The consumer core exits its idle state, restores the consumer process context, and begins executing pipe_read.
pipe_read mirrors this sequence: mutex acquisition, copy_to_user from the pipe buffer to user space, wake the
producer’s wait queue. Two more crossings, two more copies, one more mutex cycle.
The total is not four privilege transitions. It is four privilege transitions plus two mutex acquisitions, four address-space copies, two wait queue traversals, and at minimum one round of cross-core scheduler signaling per round-trip. The 5,832 ns above the 404 ns floor is not waste. It is the cost of a correctly implemented kernel IPC mechanism doing exactly what it was designed to do.
Why the Unix Domain Socket is Faster
The Unix domain socket at p50 is 4,611 ns against the pipe’s 6,236 ns. The 26% reduction has a specific mechanical cause.
A Unix domain socket in SOCK_STREAM mode uses a per-socket spinlock rather than a sleeping mutex for its internal
state. On a lightly contended path between two processes on adjacent cores, a spinlock resolves in tens of nanoseconds
rather than hundreds. The pipe mutex can block and yield the CPU. The socket spinlock spins and returns.
The buffer management also differs. unix_stream_sendmsg in the kernel writes into an sk_buff chain managed
per-socket, which avoids the page-aligned pipe buffer accounting overhead for small payloads. For 32 bytes, the socket
path through the kernel is shorter by several function call levels than pipe_write.
The fundamental cost structure is identical: four CPL3 to CPL0 crossings, four copies, scheduler involvement on wakeup. The socket is faster because each of those operations is lighter, not because any of them is absent.
This is the distinction that matters. Optimizing within the kernel boundary reduces the per-crossing overhead. It does not change the crossing count. At 56.5 ns, Tachyon is not a faster implementation of the same mechanism. It is a different mechanism.
4. The One-Shot Bootstrap
Tachyon uses a Unix domain socket exactly once. Not as a data channel. As a syringe.
The connection sequence is a single sendmsg call that transfers one file descriptor from producer to consumer via
SCM_RIGHTS. SCM_RIGHTS is a Linux ancillary data mechanism that passes open file descriptors between processes
through the kernel’s file descriptor table, without copying the underlying resource. The producer holds an anonymous
memfd region. After the handshake, both processes map the same physical pages.
struct msghdr msg{};
char buf[CMSG_SPACE(sizeof(int))] = {};
struct iovec io = { const_cast<void *>(
reinterpret_cast<const void *>(&handshake)), sizeof(TachyonHandshake) };
msg.msg_iov = &io;
msg.msg_iovlen = 1;
msg.msg_control = buf;
msg.msg_controllen = sizeof(buf);
struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_RIGHTS;
cmsg->cmsg_len = CMSG_LEN(sizeof(int));
std::memcpy(CMSG_DATA(cmsg), &shm_fd, sizeof(int));
::sendmsg(client_sock, &msg, 0);
::close(client_sock);
::close(sock);
::unlink(addr.sun_path);
The three lines after sendmsg are the architectural statement. The client socket is closed. The listening socket is
closed. The socket file is unlinked from the filesystem. From this point forward, the socket path does not exist. A
second connect attempt returns ENOENT.
The handshake payload is a 20-byte struct:
struct TachyonHandshake {
uint32_t magic; // 0x54414348 ("TACH")
uint32_t version; // 0x02
uint32_t capacity; // ring size in bytes
uint32_t shm_size; // sizeof(MemoryLayout) + capacity
uint32_t msg_alignment; // TACHYON_MSG_ALIGNMENT (64)
};
If magic, version, or msg_alignment do not match on the consumer side, the connection is rejected before the first
byte of data is exchanged. The ABI contract is enforced at connection time, not at runtime.
The memfd created by the producer is anonymous. It has no filesystem path. It cannot be opened by a third process. It
exists only as a file descriptor in the kernel’s file table, mapped into both address spaces via mmap(MAP_SHARED). On
Linux, memfd_create is called with MFD_ALLOW_SEALING, and F_SEAL_SHRINK | F_SEAL_GROW | F_SEAL_SEAL are applied
immediately after ftruncate. The region cannot be resized by either party after the handshake completes.
This is the only moment the kernel is involved in the data path. One sendmsg. One recvmsg. The socket is discarded.
Everything that follows operates in user space.
5. The Silent Hot Path
The difference between 56 ns and 6,236 ns is visible in the binary.
Every write call in bench_ipc resolves through three levels of indirection. The call site in the producer loop jumps
to the PLT stub:
call 4003a0 <write@plt>
The PLT stub resolves to __write in libc:
__write:
endbr64
movsxd rdi, edi
xor r9d, r9d
xor r8d, r8d
xor ecx, ecx
sub rsp, 0x8
push 0x1
call 6ec70 <__syscall_cancel> ; privilege transition
__syscall_cancel issues the syscall instruction. The CPU switches from CPL3 to CPL0. The kernel validates the file
descriptor, acquires the pipe’s internal spinlock, copies the payload into the kernel pipe buffer, and wakes the
consumer if it is sleeping. The return path restores user state. On a kernel with KPTI active, CR3 switches twice per
crossing.
This executes four times per round-trip. There is no path through which a message transits a pipe without it.
The Tachyon Commit
tachyon_commit_tx writes the message header into the ring and increments the batch counter. Under pure-spin mode with
fewer than 32 pending messages, this is the complete execution path:
tachyon_commit_tx:
mov rax, QWORD PTR [rdi+0x80] ; load tx_reserved_size
lea rcx, [rax-0x40]
cmp rsi, rcx ; validate actual_size <= reserved - header
ja 5448 ; reject if oversized
mov rcx, QWORD PTR [rdi+0x40] ; load shm base
mov r8, QWORD PTR [rdi+0x48] ; load local_head
and r8, QWORD PTR [rdi+0x50] ; apply capacity mask
mov DWORD PTR [rcx+r8*1+0x300], esi ; write size
mov DWORD PTR [rcx+r8*1+0x304], edx ; write type_id
mov DWORD PTR [rcx+r8*1+0x308], eax ; write reserved_size
add QWORD PTR [rdi+0x50], rcx ; advance local_head
mov QWORD PTR [rdi+0x80], 0x0 ; clear tx_reserved_size
inc rax ; pending_tx++
cmp rax, 0x20 ; pending_tx < 32?
jae 546d ; flush if batch full
ret ; <- returns here on hot path
No call. No syscall. No privilege transition. The message header is written to the ring buffer with three MOV
instructions and the function returns. The kernel has no knowledge this occurred.
The syscall instruction exists in tachyon_commit_tx, at address 0x54e5. It is a futex(FUTEX_WAKE). It is reached
only when two conditions are simultaneously true: the batch counter has reached 32, and the consumer has set its
consumer_sleeping flag to CONSUMER_SLEEPING rather than CONSUMER_PURE_SPIN. In a benchmarking configuration with
both sides in pure-spin mode, the branch at 0x54a5 never fires. The lock or DWORD PTR [rsp-0x40], 0x0 at 0x54a7,
the x86 encoding of a full memory barrier, is equally skipped. The hot path is MOV instructions and a comparison.
This is not an optimization. It is the consequence of removing the kernel from the data path entirely during the bootstrap phase. There is nothing to call. The shared memory region is already mapped in both address spaces. The consumer is already polling. The producer writes, increments, and returns.
perf stat makes the absence countable:
pipe (1,100,000 round-trips): 2,200,001 sys_enter_write + 2,200,005 sys_enter_read
Tachyon (1,000,000 messages): 0 sys_enter on hot path
The crossing count is not reduced. It is zero.
Conclusion: Count the Crossings
The 110x delta between pipe and Tachyon is not a benchmark artifact. It is long division.
A pipe round-trip has four user/kernel boundary crossings. Each crossing carries a minimum hardware tax of 101 ns, measured in isolation on this machine. Four crossings plus kernel buffer management, scheduler wakeups, and TLB pressure under KPTI produce a p50 of 6,236 ns. The arithmetic is not subtle. The measurement confirms it to within the expected margin.
Tachyon has zero crossings on the hot path. perf stat confirms zero sys_enter events during message transfer. The
assembly confirms no call to any syscall wrapper in the commit path. The 56.5 ns p50 is the cost of writing three
DWORD values into a shared memory region, incrementing a counter, and returning. That is the entire operation.
The practical consequence for latency-sensitive systems is direct:
- Count your crossings before profiling anything else. A single
writeper message is a syscall. A singlereadis another. If your IPC path calls into the kernel on every message, the microsecond floor is structural, not tunable. No thread priority, no CPU isolation, no compiler flag removes it. - Lock-free is not kernel-free. An atomic queue between threads needs no syscall. An atomic queue between processes over a pipe or socket does. The synchronization primitive is not the cost. The transport is.
- The bootstrap cost is paid once. The Unix domain socket handshake that transfers the
memfdis a syscall. It happens once per session. Amortized across 1,000,000 messages, it contributes zero nanoseconds to the per-message latency. Design your IPC so the kernel is present at connection time and absent at runtime. - Verify with
perf stat.syscalls:sys_enter_write,syscalls:sys_enter_read,syscalls:sys_enter_sendto,syscalls:sys_enter_recvfrom. If these counters scale linearly with your message count, the kernel is on your critical path. The fix is not in the code. It is in the transport.
The kernel is correct, reliable, and battle-tested across decades of production workloads. It is also, on a modern P-core, 101 nanoseconds per crossing at minimum. In a system exchanging a million messages per second, that minimum compounds to 404 ms of mandatory kernel time per second of operation, before a single byte of payload is processed.
Shared memory does not make IPC faster. It invoices the kernel at connection time and never again.
The crossing count is the latency floor. Make it zero.