Instruction count is a static property of your code. Execution time is a dynamic property of your dependency graph.


Abstract

When optimizing hot paths, engineering teams instinctively count instructions. This reflex is rooted in a reasonable intuition: fewer operations should mean faster execution. It relies on a mental model inherited from the era of scalar, in-order processors — a model that evaluates performance under a silently lethal assumption: that all instructions of the same type carry the same cost.

On modern superscalar Out-of-Order architectures, this assumption is physically false.

The instruction-count model possesses a fundamental blind spot: the distinction between Latency and Throughput. In Agner Fog’s instruction tables for Intel Alder Lake, the two-operand register form imul r64, r64 carries a latency of 3 cycles and a reciprocal throughput of 1 cycle. These are not the same number. They are not interchangeable. They describe two entirely different physical phenomena — and confusing them is the difference between a correctly optimized hot path and a serialized, stalled pipeline.

When a developer chains dependent imul instructions — where each multiplication consumes the result of the previous one — the Out-of-Order engine is powerless. The Reservation Station holds every subsequent instruction in a stall, waiting for the 3-cycle latency to resolve. The port is available. The execution unit is idle. The silicon is ready. But the Read-After-Write (RAW) hazard enforces a strict serial order that no amount of Out-of-Order magic can bypass. The result is an IPC of 0.33 on a machine capable of dispatching 6 instructions per cycle — a utilization rate of 5.5%.

This technical analysis deconstructs the physical cost of dependency chains across two independent failure modes: integer serialization via chained imul, and polynomial critical path collapse via Horner’s method. By benchmarking both against structurally independent equivalents under strict core-pinning conditions, we expose a 2.9x integer performance delta and a 1.97x polynomial delta, both driven entirely by dependency graph topology. We demonstrate how to recover the full throughput budget by transforming serial dependency chains into parallel Independent Instruction Streams. Correlating x86-64 assembly with llvm-mca cycle-accurate simulation across all four variants, we prove a truth that no profiler will surface directly: your bottleneck is not the instruction — it is the edge in the dependency graph.


1. The Two Traps of Instruction-Level Thinking

When a developer profiles a hot loop and finds it dominated by multiplications, the natural response is to minimize the number of imul instructions. The logic is intuitive: reduce the operation count, reduce the runtime.

This logic is correct in exactly one scenario — when every instruction is independent. In any other case, it is dangerously incomplete.

Trap 1: Counting Instructions Instead of Measuring the Critical Path

Consider the innermost loop of a dependency chain benchmarked under 1,000,000 iterations:

uint64_t a = x;
for (int i = 0; i < RUNS; ++i) {
    asm volatile("imulq %1, %0\n\t"
                 "imulq %1, %0\n\t"
                 "imulq %1, %0\n\t"
                 "imulq %1, %0\n\t"
                 "imulq %1, %0\n\t"
                 "imulq %1, %0\n\t"
                 "imulq %1, %0\n\t"
                 "imulq %1, %0\n\t"
                 "imulq %1, %0\n\t"
                 "imulq %1, %0\n\t"
                 "imulq %1, %0\n\t"
                 "imulq %1, %0\n\t"
                 : "+r"(a)
                 : "r"(x)
    );
}

12 multiplications per iteration. On paper: 12 operations. On silicon, under strict core-pinning to a P-core with LFENCE/RDTSCP serialized measurement:

  • Dependent IMUL chain: 22.2 cycles/iter (TSC)

The developer who counts instructions sees 12 multiplications and expects roughly 12 cycles. The processor delivers 22. The discrepancy is not measurement noise — it is the physical signature of a serialized pipeline.

Trap 2: Misreading the Instruction Tables

Open Agner Fog’s instruction tables for Intel Alder Lake P-core and locate the imul family. This is where most engineers make a critical error — not in the reading, but in the selection.

imul is not a single instruction. It is a family of encodings with fundamentally different silicon characteristics:

Form                      Latency    RThroughput    Port
imul r64, r64               3           1.00        ALU0
imul r64, r64, imm          4           2.00        ALU0

The three-operand immediate form — imul rax, rbx, 7 — has a latency of 4 cycles and a throughput of 2. A developer profiling a codebase that mixes both forms and extracts a single number from the table has already introduced a measurement error before touching the hardware.

Throughout this benchmark, every imul is the two-operand register-register form: imul rax, rdi. Latency 3 cycles, reciprocal throughput 1 cycle, confirmed empirically by llvm-mca on ADLPPort01. This distinction is not pedantic — it is the difference between predicting 36 cycles per dependent chain and predicting 48.

With the correct form identified, the two columns still describe entirely different physical phenomena.

Latency is the number of cycles a dependent instruction must wait before its result is available to a consumer. If instruction B reads the output of instruction A, and A has a latency of 3, then B cannot begin executing until cycle 4 relative to A’s start.

Reciprocal Throughput is the minimum number of cycles between two independent invocations of the same instruction on the same execution port. A throughput of 1.00 means the port can accept a new imul every single cycle — but only if the inputs are ready.

These two properties describe the two axes of execution: time and bandwidth. Latency governs the critical path of a dependency chain. Throughput governs the saturation rate of an execution port when instructions are independent.

The performance of any instruction sequence is determined by whichever constraint is binding. A chain of 12 dependent imul r64, r64 instructions is bounded entirely by latency: 12 × 3 = 36 cycles. The throughput constraint — 1 cycle per instruction — is completely irrelevant, because the dependency chain prevents the port from ever seeing two independent instructions in sequence.

The Reality: Bending Software to Hardware

Systems engineering does not accept either trap. We do not accept the 22-cycle serialized chain of the dependent loop. We equally reject the misinterpretation of RThroughput = 1.00 as a flat cost of 1 cycle per instruction.

We require the throughput of independent instructions combined with the correct register topology to eliminate the RAW hazard.

  • Independent IMUL (12 accumulators): 7.5 cycles/iter (TSC)

By replacing one accumulator with twelve independent ones — the exact same number of imul instructions, the same inputs, the same output volume — we compress 22 cycles to 7.5: a 2.9x multiplier driven entirely by dependency graph restructuring.


2. The Mechanics of the Out-of-Order Engine: Ports, Reservation Stations, and the Critical Path

To understand precisely why the dependency chain collapses to 22 cycles, you must stop thinking of instructions as a sequential list. You must think in terms of a dataflow graph.

The Execution Substrate

A modern superscalar P-core (Intel Lion Cove in Alder Lake, AMD Zen 5) does not execute instructions in program order. The Out-of-Order engine decodes the instruction stream and dispatches micro-ops into the Reservation Station (RS) — a holding buffer that tracks operand readiness. Once all input operands for a given micro-op are marked as ready by the Register File or the Writeback Bus, the RS dispatches the micro-op to the appropriate Execution Port.

On Alder Lake P-cores, imul r64, r64 dispatches exclusively to ADLPPort01. This port has a throughput of 1 imul per cycle and a pipeline depth of 3 cycles. In the absence of dependencies, it can accept a new multiplication every cycle. Floating-point FMA (vfmadd213sd) differs: it dispatches to either ADLPPort00 or ADLPPort01, with a reciprocal throughput of 0.50 — meaning two FMAs can retire per cycle when independent, one per port.

The critical insight: the Reservation Station can only dispatch an instruction when its operands are available. If instruction B depends on instruction A, and A has not yet written its result to the register file, B remains in the RS — occupying a slot, blocking the scheduler, consuming power — for the full latency duration of A.

The Critical Path and ILP

The critical path of a computation is the longest chain of dependent operations in the dataflow graph. No amount of Out-of-Order execution, no number of execution ports, no microarchitectural sophistication can execute a dependent chain faster than its critical path length.

Instruction-Level Parallelism (ILP) is the inverse: independent instructions, with no data edges between them, can be dispatched simultaneously to available ports. The OoO engine’s job is to find and exploit ILP — but it can only exploit what the programmer has made available in the dependency structure of the code.

If you hand the engine a chain of 12 dependent imul instructions, the dependency graph is a straight line. The ILP is exactly 1. Every instruction waits for the previous one. The OoO engine, for all its complexity, is reduced to an in-order scalar processor.

If you hand it 12 independent multiplications, the dependency graph has no edges between them. The ILP is 12. The engine dispatches all 12 to the same port in a rolling window of 1 cycle each, hiding the 3-cycle latency entirely behind the instruction stream of subsequent iterations.


3. Case Study: Horner vs. Estrin — The Same Math, Physically Opposite

The clearest real-world manifestation of latency-vs-throughput is polynomial evaluation. Both methods below evaluate the same degree-7 polynomial over the same input. Both produce numerically equivalent results. Their runtime on silicon differs by 1.97x — not because of their operation count, but because of their dependency topology.

The Horner Trap: Optimal Algebra, Pathological Physics

Horner’s method is the textbook algorithm for polynomial evaluation. It minimizes the number of multiplications via recursive factoring:

P(x) = c₀ + x(c₁ + x(c₂ + x(c₃ + x(c₄ + x(c₅ + x(c₆ + x·c₇))))))

In C++:

double bench_horner(const double x) {
    double res = x;
    for (int i = 0; i < RUNS; ++i) {
        res = c0 + res * (c1 + res * (c2 + res * (c3 + res * (c4 + res * (c5 + res * (c6 + res * c7))))));
        DoNotOptimize(res);
    }
    return res;
}

The dependency structure is a straight chain. Each FMA feeds directly into the next. The critical path is 7 × latency(vfmadd213sd). On Alder Lake, FMA latency is 4 cycles:

Critical path = 7 × 4 = 28 cycles/iter
  • Horner benchmark: 17.7 cycles/iter (TSC) — 28.3 cycles/iter (llvm-mca)

The Estrin Solution: Restructuring the Tree

Estrin’s scheme evaluates the same polynomial by splitting it into independent sub-expressions that form a balanced binary tree:

double bench_estrin(const double x) {
    double res = x;
    for (int i = 0; i < RUNS; ++i) {
        double r2 = res * res;
        double r4 = r2 * r2;

        double sub1 = (c0 + c1 * res);   // independent
        double sub2 = (c2 + c3 * res);   // independent
        double sub3 = (c4 + c5 * res);   // independent
        double sub4 = (c6 + c7 * res);   // independent

        res = (sub1 + sub2 * r2) + (sub3 + sub4 * r2) * r4;
        DoNotOptimize(res);
    }
    return res;
}

sub1, sub2, sub3, sub4 share no data edges. They can dispatch simultaneously. The critical path collapses from a linear chain of 7 to a binary tree of depth 3:

Critical path = 3 × 4 = 12 cycles/iter
  • Estrin benchmark: 8.97 cycles/iter (TSC) — 14.3 cycles/iter (llvm-mca)

The polynomial is mathematically identical. The execution time differs by 1.97x — not because Estrin does fewer operations, but because it exposes ILP that Horner structurally forbids.


4. Assembly Proof: The Dependency Made Visible

Dependent IMUL — One Register, One Chain

The compiler generates a textbook RAW dependency chain. Every imul writes to rax. Every subsequent imul reads from rax. There are no independent paths — the Reservation Station has no choice but to serialize:

; bench_dependent_imul hot loop — compiled -O3 -march=native
    mov    ecx, 0xf4240           ; loop counter = 1,000,000
    mov    rax, rdi               ; a = x (initial accumulator)
    imul   rax, rdi               ; a *= x  ← RAW: rax written here
    imul   rax, rdi               ; a *= x  ← RAW: must wait 3cy for [400570]
    imul   rax, rdi               ;          ← must wait 3cy for [400574]
    imul   rax, rdi               ;          ← must wait 3cy for [400578]
    imul   rax, rdi               ;          ← must wait 3cy for [40057c]
    imul   rax, rdi               ;          ← must wait 3cy for [400580]
    imul   rax, rdi               ;          ← must wait 3cy for [400584]
    imul   rax, rdi               ;          ← must wait 3cy for [400588]
    imul   rax, rdi               ;          ← must wait 3cy for [40058c]
    imul   rax, rdi               ;          ← must wait 3cy for [400590]
    imul   rax, rdi               ;          ← must wait 3cy for [400594]
    imul   rax, rdi               ;          ← must wait 3cy for [400598]
                                  ; critical path: 12 × 3 = 36 cycles/iter
    dec    ecx
    jne    400570
    ret

The entire loop body is a single dependency chain of depth 12. The OoO engine sees one edge after another, each enforcing a 3-cycle stall. ADLPPort01 is available every cycle — but the Reservation Station cannot dispatch to it. The port starves.

Independent IMUL — 12 Registers, Zero Edges

The independent version uses 12 distinct accumulators, each holding a separate chain rooted in its own register. No instruction reads the output of any other:

; bench_independent_imul hot loop — compiled -O3 -march=native
    push   rbp / r15 / r14 / r12 / rbx   ; 12 accumulators need callee-saved regs
    lea    r11, [rdi+0x1]                 ; b = x+1  ← distinct register
    lea    rsi, [rdi+0x2]                 ; c = x+2
    lea    r15, [rdi+0x3]                 ; d = x+3
    lea    rcx, [rdi+0x4]                 ; e = x+4
    lea    r14, [rdi+0x5]                 ; f = x+5
    lea    rbx, [rdi+0x6]                 ; g = x+6
    lea    rax, [rdi+0x7]                 ; h = x+7
    lea    r10, [rdi+0x8]                 ; i = x+8
    lea    r9,  [rdi+0x9]                 ; j = x+9
    lea    r8,  [rdi+0xa]                 ; k = x+10
    lea    rdx, [rdi+0xb]                 ; l = x+11
    mov    ebp, 0xf4240                   ; loop counter
    mov    r12, rdi                       ; a = x
    imul   r12, rdi                       ; a *= x  ← no dependency on b,c,d...
    imul   r11, rdi                       ; b *= x  ← no dependency on a
    imul   rsi, rdi                       ; c *= x  ← no dependency on a,b
    imul   r15, rdi                       ; d *= x  ← independent
    imul   rcx, rdi                       ; e *= x  ← independent
    imul   r14, rdi                       ; f *= x  ← independent
    imul   rbx, rdi                       ; g *= x  ← independent
    imul   rax, rdi                       ; h *= x  ← independent
    imul   r10, rdi                       ; i *= x  ← independent
    imul   r9,  rdi                       ; j *= x  ← independent
    imul   r8,  rdi                       ; k *= x  ← independent
    imul   rdx, rdi                       ; l *= x  ← independent
                                          ; 12 imul dispatched in rolling window: ~12 cycles
    dec    ebp
    jne    4005f0

The Reservation Station sees 12 micro-ops with no mutual dependencies. It dispatches all 12 to ADLPPort01 in a rolling window — one per cycle, back-to-back. The 3-cycle execution latency of each imul is fully pipelined behind the dispatch stream of the next iteration. The port is saturated. The critical path is max(12 × 1, 3) = 12 cycles — throughput-bound, not latency-bound.

Horner — 7 FMA Instructions, 1 Register, 1 Chain

; bench_horner hot loop — compiled -O3 -march=native
    vmovapd xmm9, xmm1                   ; 1. xmm9 = c0 (fresh copy each iter)
    vfmadd213sd xmm9, xmm0, xmm2         ; 2. xmm9 = xmm9*x + c1  ← depends on [1]
    vfmadd213sd xmm9, xmm0, xmm3         ; 3. xmm9 = xmm9*x + c2  ← depends on [2], wait 4cy
    vfmadd213sd xmm9, xmm0, xmm4         ; 4.                       ← depends on [3], wait 4cy
    vfmadd213sd xmm9, xmm0, xmm5         ; 5.                       ← depends on [4], wait 4cy
    vfmadd213sd xmm9, xmm0, xmm6         ; 6.                       ← depends on [5], wait 4cy
    vfmadd213sd xmm9, xmm0, xmm7         ; 7.                       ← depends on [6], wait 4cy
    vfmadd213sd xmm9, xmm0, xmm8         ; 8.                       ← depends on [7], wait 4cy
    vmovapd xmm0, xmm9                   ; 9. writeback for next iter ← depends on [8]
    dec    eax
    jne    4006a0

Seven vfmadd213sd instructions. All writing to xmm9. All reading from xmm9. The register file sees a single RAW chain of length 7. The Reservation Station cannot dispatch instruction [k+1] until instruction [k] has completed its 4-cycle FMA pipeline and written its result back. Every port slot between [k]’s dispatch and [k+1]’s eligibility is wasted.

Estrin — Same Operations, Broken Chain

; bench_estrin hot loop — compiled -O3 -march=native
    vmulsd  xmm9,  xmm0, xmm0            ; 1. r2 = x²
    vmulsd  xmm10, xmm9, xmm9            ; 2. r4 = r2²            ← depends on [1] only
    vmovapd xmm11, xmm1
    vfmadd213sd xmm11, xmm0, xmm2        ; 3. sub1 = c0 + c1*x    ← INDEPENDENT of [1],[2]
    vmovapd xmm12, xmm3
    vfmadd213sd xmm12, xmm0, xmm4        ; 4. sub2 = c2 + c3*x    ← INDEPENDENT
    vmovapd xmm13, xmm5
    vfmadd213sd xmm13, xmm0, xmm6        ; 5. sub3 = c4 + c5*x    ← INDEPENDENT
    vfmadd213sd xmm0,  xmm7, xmm8        ; 6. sub4 = c6 + c7*x    ← INDEPENDENT
    vfmadd213sd xmm12, xmm9, xmm11       ; 7. sub1 + sub2*r2       ← depends on [1],[3],[4]
    vfmadd213sd xmm0,  xmm9, xmm13       ; 8. sub3 + sub4*r2       ← depends on [1],[5],[6]
    vfmadd213sd xmm0,  xmm10, xmm12      ; 9. final result          ← depends on [2],[7],[8]
    dec    eax
    jne    400720

Instructions [3], [4], [5], [6] share no data edges with each other, and no edges with [1] or [2]. The OoO engine dispatches all four FMAs in parallel across Port00 and Port01. The critical path runs through [1] → [2] → [7 or 8] → [9] — a depth of 3 FMAs, not 7.

The same silicon. The same ports. The same execution units. A dependency graph restructured from depth-7 to depth-3, and the runtime collapses by 1.97x without removing a single instruction.


5. Hardware Telemetry: The llvm-mca Smoking Gun

Architectural theory is meaningless until validated. To expose the micro-architectural cost with cycle-accurate precision, we model all four variants through llvm-mca against the Alder Lake P-core (ADLP) execution model.

perf stat — The Global IPC Verdict

Performance counter stats for 'taskset -c 8 ./bench':
    548,080,540      cpu_core/cycles/
    323,594,072      cpu_core/instructions/

    IPC = 323,594,072 / 548,080,540 = 0.590

A CPU with a dispatch width of 6 and a theoretical peak IPC above 4 is executing at 0.590 IPC across all four tests combined. It is operating at less than 15% of its execution capacity. Every wasted cycle is a direct consequence of a RAW hazard in a dependency chain that the programmer could have eliminated.

llvm-mca --timeline — Four Variants, Two Realities

Dependent IMUL (dep.s) — 10 iterations:

Total Cycles: 363     IPC: 0.33    uOps/Cycle: 0.33
Port: ADLPPort01 exclusively — RThroughput 1.00

[0,0]  DeeeER                               imul rax, rdi   ← executes cycles 1–3
[0,1]  D===eeeER                            imul rax, rdi   ← stalls 3 cycles on [0,0]
[0,2]  D======eeeER                         imul rax, rdi   ← stalls 6 cycles
[0,3]  D=========eeeER                      imul rax, rdi   ← stalls 9 cycles
...
[0,11] D================================eeeER               ← stalls 33 cycles before executing

Average wait time: 170 cycles.

Independent IMUL (indep.s) — 10 iterations:

Total Cycles: 125     IPC: 0.96    uOps/Cycle: 0.96
Port: ADLPPort01 exclusively — RThroughput 1.00

[0,0]  DeeeER     imul r12, rdi   ← dispatches cycle 1, no wait
[0,1]  D=eeeER    imul r11, rdi   ← 1 cycle queue slot only
[0,2]  D==eeeER   imul rsi, rdi
...
[0,11] D==========eeeER           ← dispatches cycle 12

Average wait time: 51 cycles.

Horner (horner.s) — 10 iterations:

Total Cycles: 283     IPC: 0.25    uOps/Cycle: 0.25
Ports: ADLPPort00 + ADLPPort01 — FMA RThroughput 0.50 (dual-port)

[0,0]  DeeeeER        vfmadd213sd xmm9, xmm0, xmm2   ← executes, 4cy latency
[0,1]  D====eeeeER    vfmadd213sd xmm9, xmm0, xmm3   ← stalls 4cy on [0,0]
[0,2]  D========eeeeER             ...                ← stalls 8cy
[0,3]  D============eeeeER         ...                ← stalls 12cy
[0,4]  D================eeeeER     ...                ← stalls 16cy
[0,5]  D====================eeeeER ...                ← stalls 20cy
[0,6]  D========================eeeeER                ← stalls 24cy before executing

Average wait time: 133.7 cycles. 28.3 cycles/iter — matches 7 × 4 = 28 exactly.

Estrin (estrin.s) — 10 iterations:

Total Cycles: 143     IPC: 0.63    uOps/Cycle: 0.63
Ports: ADLPPort00 + ADLPPort01 — FMA RThroughput 0.50 (dual-port)

[0,0]  DeeeeER        vmulsd  xmm9, xmm0, xmm0    ← r2 = x²
[0,1]  D====eeeeER    vmulsd  xmm10, xmm9, xmm9   ← r4, depends on [0,0]
[0,2]  DeeeeE----R    vfmadd213sd xmm11 ...        ← sub1: INDEPENDENT, executes immediately
[0,3]  D=eeeeE---R    vfmadd213sd xmm12 ...        ← sub2: INDEPENDENT
[0,4]  D=eeeeE---R    vfmadd213sd xmm13 ...        ← sub3: INDEPENDENT
[0,5]  D==eeeeE--R    vfmadd213sd xmm0  ...        ← sub4: INDEPENDENT
[0,6]  D====eeeeER    vfmadd213sd xmm12 ...        ← sub1+sub2*r2
[0,7]  D=====eeeeER   vfmadd213sd xmm0  ...        ← sub3+sub4*r2
[0,8]  D=========eeeeER             ...            ← final result

Average wait time: 60.2 cycles. 14.3 cycles/iter.

The Estrin timeline requires careful reading. Instructions [0,2] through [0,5] show the pattern DeeeeE----R. The E fires immediately — the sub-expression executes at full speed with no dependency stall. The ---- is not a stall. It is a retirement wait: these instructions finished executing before older instructions in the Reorder Buffer were ready to commit. The execution ports are completely free during those cycles. This is the hardware signature of ILP working correctly — the independent sub-expressions execute so fast that they outpace the retirement queue.

The Complete Telemetry Table

--- Benchmark: 1,000,000 iterations, Alder Lake P-core (taskset -c 8) ---

                              TSC cy/iter   llvm-mca cy/iter   llvm-mca IPC
Dependent IMUL  (12 ops)         22.2            36.3              0.33
Independent IMUL (12 ops)         7.5            12.5              0.96
─────────────────────────────────────────────────────────────────────────
Ratio                            2.96x           2.90x             2.9x
Theoretical minimum (throughput)   —             12.0              1.00

---

Horner polynomial  (7 FMA)       17.7            28.3              0.25
Estrin polynomial  (7 FMA)        8.97           14.3              0.63
─────────────────────────────────────────────────────────────────────────
Ratio                            1.97x           1.98x             2.5x
Theoretical minimum (depth-3)      —             12.0               —

Note: imul r64,r64 — latency=3, RThroughput=1.00, Port01 only.
      imul r64,r64,imm — latency=4, RThroughput=2.00 — not benchmarked here.
      vfmadd213sd — latency=4, RThroughput=0.50, Port00+Port01 (dual-port).
TSC cycles reflect reference clock; llvm-mca reflects core-clock simulation.

The 2.9x integer delta and 1.97x polynomial delta both require zero architectural changes and zero instruction count reduction. They require only the recognition that a RAW edge in the dependency graph costs its full latency — every time, unconditionally, regardless of how sophisticated the OoO engine is.


Conclusion: The Dependency Graph is the Program

The performance deltas measured here — 2.9x on integer multiplication, 1.97x on polynomial evaluation — are not micro-benchmarking artifacts. They are the fundamental physical contract of modern Out-of-Order execution.

In latency-critical domains — high-frequency trading, real-time signal processing, kernel-level packet classification — this distinction is not academic. A hot path dominated by chained FMAs in a risk calculation, a pricing engine, or an order book update routine will execute at a fraction of the hardware’s capacity if its dependency structure is not deliberately designed.

True high-performance engineering demands a shift in perspective. Instruction count is a first-order approximation. Dependency graph depth is the ground truth.

The practical discipline is direct:

  1. Read the full table. When you consult Agner Fog, read both columns — latency and throughput — and verify you are reading the correct encoding. The imul family alone spans multiple entries with materially different characteristics. The binding constraint in your loop depends on whether your instructions form a chain or a parallel stream.
  2. Expose ILP structurally. Horner is mathematically optimal. Estrin is physically optimal. The compiler cannot make this transformation safely — floating-point reassociation is not always valid under -O3. You must make it explicitly.
  3. Verify with llvm-mca. The --timeline output makes the dependency chain visible as a physical artifact, not an abstraction. If you see = characters growing iteration over iteration, you are looking at your critical path. If you see E----R, the ILP is working — those are instructions that finished executing before the retirement queue could keep up.
  4. Design the dataflow, not the control flow. Your hot loop is not a sequence of instructions. It is a graph of data dependencies. Every edge in that graph is a constraint on the scheduler. Every unnecessary edge is cycles left on the table.

The OoO engine is among the most sophisticated pieces of engineering in modern silicon. It reorders instructions, renames registers, speculates across branches, and dispatches to a dozen execution ports simultaneously. It will extract every available cycle of parallelism from the instruction stream you give it.

But it cannot break a data dependency. That is your job.

Profile your instructions. Audit your edges.