Skip to content

cybertronai/ByteDMD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A cost model of complexity for the 21st century: ByteDMD

Data movement matters more than FLOPs. Recently accessed bytes can be cached, penalize non-local reads using the following cost model:

$$C=\sum_{b \in bytes} \sqrt{D(b)}$$

where $D(b)$ is the depth of byte $b$ in the LRU stack. Square-root is motivated by VLSI routing cost in 2D.

Usage

from bytedmd import bytedmd

def dot(a, b):
    return sum(i1*i2 for (i1,i2) in zip(a,b))

a = [0, 1]
b = [2, 3]

# dot product
assert dot(a,b) == 3

# ByteDMD cost of dot product
assert bytedmd(dot, (a, b)) == 12

Interactive visualization: dot product on the LRU stack — animated walkthrough of how the stack evolves during a dot product computation.

Motivation

Modern architectures spend more energy moving data than doing arithmetic, making FLOP counts an outdated cost metric. Bill Dally (ACM Opinion) proposed penalizing data movement based on 2D spatial distance to the processor. To avoid manual spatial mapping, Ding and Smith (Beyond Time Complexity, 2022) automated this via Data Movement Distance (DMD): a rule treating memory as an LRU stack where reading a byte at depth $d$ costs $\sqrt{d}$, modeling a cache laid out in 2D.

To avoid floating point issues, we round up to the nearest integer.

ByteDMD

This rounding corresponds to routing wire length on a 2D grid with LRU stack arranged in the following order.

ByteDMD

Computation Model

An idealized processor operates directly on an element-level LRU stack. Computations and writes are free; only memory reads incur a cost.

  • Stack State: Ordered from least recently used (bottom) to most recently used (top). Depth is measured in bytes from the top (topmost byte = depth 1). Multi-byte scalars are treated as contiguous blocks of bytes.
  • Eager initialization: Arguments are loaded onto the stack left to right — the first argument sits at the top (depth 1). All input elements are live and addressable from the start.
  • Read Cost: Reading a byte at depth $d$ costs $\lceil\sqrt{d}\rceil$.
  • Simultaneous pricing: All inputs to an instruction are priced against the stack state before any LRU bumping. This guarantees commutativity: Cost(a+b) == Cost(b+a).
  • Only live contribute to depth of the stack: Any value that's dead (no longer used) is immediately removed from the stack and remaining elements slide up to close the gap. This models an optimal compiler that keeps the stack clamped to the active working set.

Instruction Semantics

See Instruction Set for the complete list of supported instructions.

For an instruction with inputs $x_1, \dots, x_m$ and outputs $y_1, \dots, y_n$ with $m\ge 1, n\ge 0$

  1. Price reads: Evaluate $\sum C(x_j)$ simultaneously against the stack state before the instruction begins. All inputs see the same pre-instruction snapshot. Repeated inputs are charged per occurrence at the same depth (e.g., a + a charges ⌈√d⌉ twice where d is a's pre-instruction depth).
  2. Update LRU: Batch-move unique inputs to the top of the stack in read order. b + c and c + b yield the same cost (commutativity) but may differ in final stack order.
  3. Push outputs: Allocate new output blocks and push them to the top at zero cost.

Example Walkthrough

Consider the following function with three scalar arguments:

def my_add(a, b, c):
    return (a + b) + c

1. Initial Stack (left = top, right = bottom) Arguments are loaded left to right — first argument at the top:

[a, b, c]    ← a at distance 1, b at distance 2, c at distance 3

2. First operation: a + b
Both operands are priced simultaneously against the initial stack:

$$C(a) + C(b) = \lceil\sqrt{1}\rceil + \lceil\sqrt{2}\rceil = 1 + 2 = 3$$

After LRU bumping and pushing the result t = a + b:

[t, b, a, c]    ← t at distance 1, b at distance 2, a at distance 3, c at distance 4

Liveness analysis evicts a and b (their last use just happened):

[t, c]    ← t at distance 1, c at distance 2

3. Second operation: t + c
$$C(t) + C(c) = \lceil\sqrt{1}\rceil + \lceil\sqrt{2}\rceil = 1 + 2 = 3$$

Total cost: $3 + 3 = 6$. Trace: [1, 2, 1, 2].

Inspecting the IR

The tracer also emits a small intermediate representation that makes the LRU stack lifecycle explicit. Three event types: STORE k (allocate vk on top), READ k@d (read vk at depth d and LRU-bump), OP name(vk@d, …) (summary of the preceding reads — this is what incurs cost). Op results are materialized by the STORE that immediately follows the OP.

from bytedmd import inspect_ir, format_ir, bytedmd

def matvec2(A, x):
    y0 = A[0][0]*x[0] + A[0][1]*x[1]
    y1 = A[1][0]*x[0] + A[1][1]*x[1]
    return [y0, y1]

print(format_ir(inspect_ir(matvec2, ([[1,2],[3,4]], [5,6]))))
STORE v1                                # x[0] loaded first (deepest)
STORE v2                                # x[1]
STORE v3                                # A[0][0]
STORE v4                                # A[0][1]
STORE v5                                # A[1][0]
STORE v6                                # A[1][1]
  READ v3@4  cost=2                     # A[0][0] (left-to-right: A at top)
  READ v1@6  cost=3                     # x[0] at bottom
OP    mul(v3@4, v1@6)  cost=5           # A[0][0]*x[0]
STORE v7
  READ v4@5  cost=3                     # A[0][1] (v3 evicted after last use)
  READ v2@6  cost=3                     # x[1]
OP    mul(v4@5, v2@6)  cost=6           # A[0][1]*x[1]
STORE v8
  READ v7@3  cost=2                     # hot hit: v7 sank as v4, v2 entered
  READ v8@1  cost=1                     # hot hit: v8 still at top
OP    add(v7@3, v8@1)  cost=3           # y0
STORE v9
  READ v5@5  cost=3                     # A[1][0] (dead temps evicted)
  READ v1@3  cost=2                     # hot hit: x[0] still on stack
OP    mul(v5@5, v1@3)  cost=5
STORE v10
  READ v6@4  cost=2                     # A[1][1]
  READ v2@3  cost=2                     # hot hit: x[1]
OP    mul(v6@4, v2@3)  cost=4
STORE v11
  READ v10@2  cost=2
  READ v11@1  cost=1
OP    add(v10@2, v11@1)  cost=3         # y1
STORE v12
# total cost = 26

Note the left-to-right initialization: A elements (the first argument) sit at the top of the stack, while x elements (the second argument) are deeper. Liveness analysis aggressively evicts dead variables: after v3's single read, it is removed and remaining elements slide up. This keeps the stack clamped to the active working set.

ByteDMD benchmarks

See "benchmarks/" folder

Linear algebra (benchmark_linalg.py)

ByteDMD cost by method and matrix size N:

|    |  matvec |   vecmat | naive matmul |
|  N | (y=A@x) | (y=xᵀ@A) |      (i-j-k) |
|----|---------|----------|--------------|
|  2 |      26 |       25 |           57 |
|  4 |     157 |      150 |          720 |
|  8 |     896 |      832 |        8,867 |
| 16 |   5,354 |    4,688 |      109,783 |

microGPT single-token forward pass

Architecture: vocab=4, embd=4, heads=2, head_dim=2, 1 layer, block_size=4. Based on Karpathy's microGPT.

Algorithm Operation ByteDMD Cost
microGPT (1 layer, embd=4) single token forward 3214

Reports

In-depth reports applying ByteDMD to specific algorithms and design questions:

Python Gotcha's

The tracer implements ByteDMD by wrapping Python objects. This means that the "Instruction Set" of this metric corresponds to Python built-ins, documented under docs/instruction_set.md.

Python behavior means this implementation occasionally doesn't match README semantics and it is possible to escape the wrapping mechanism (local arrays, exception side-channels, identity ops, type introspection, f-strings, math.trunc/ceil/floor on tracked values, etc.). Known failure cases are documented in test_gotchas.py — avoid those patterns when writing code you want measured.

Original Google Doc

About

Examples and checks for the ByteDMD metric of complexity

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages