Doppler · Doe · Cerebras

WebGPU models to Cerebras hardware checks, measured against browser output

Doppler runs the model locally in WebGPU. Doe turns the same model program into kernel code for Cerebras hardware and a run script. The proof file records the reference output, the generated code, and the numeric comparison.

The practical value: less per-model Cerebras hand-authoring for models that already have clean WebGPU implementations. Two model paths have proof files today: Gemma 4 31B and Qwen 3.6 27B.

01 · EVIDENCE

What already works locally

Before a WSE run, we can already check real model state from WebGPU against generated code for Cerebras hardware, compile both model paths, and name the local simulator boundary.

MATCHED

Gemma selected token

WebGPU runs <bos>The color of the sky is through real Gemma 31B state. Generated CSL consumes that hidden state and computes the output-head score for token 3730 ( blue). Difference from WebGPU: 0.0087417.

MATCHED

Qwen selected token

The same check now exists for Qwen 3.6 27B. WebGPU supplies the real final-prompt state; generated CSL runs final norm plus the selected output-head row for token 760. Difference from WebGPU: 0.0133286.

COMPILES

Both model paths compile

Gemma 4 31B has its full generated target set compiled cleanly. Qwen 3.6 27B now has its parallel target set compiled and packaged under the same Q4K-to-f16 model-weight path.

PACKAGED

Gemma and Qwen run commands

The repo now carries one-command hardware paths for both models. Each command fetches hosted model weights, verifies the evidence archive, compiles generated CSL, and runs against a provided endpoint.

CEILING

Local simulator boundary

For Gemma, a real full-size lm_head width tile transfers activation data, transfers trained weights, and reaches compute completion. The local CPU simulator then stops at device-to-host copyback. The full transcript needs a hardware run.

Also in the bundle: frozen browser reference outputs, per-kernel checks, width-tiled lm_head CSL, prep script, archive verifier, and publication boundaries.

02 · CHECK LOOP

Browser output to hardware output

The proof file records what WebGPU produced, what code was generated for Cerebras hardware, what ran, and how close the numbers were.

REFERENCE

Browser model

JavaScript and WGSL model code that runs locally.

DOCS Program Bundle

REFERENCE

Doppler bundle

Portable graph, kernels, shapes, parameters, and expected output.

HOSTED Doppler demo

Doe TSIR

Tiled Spatial IR keeps shape and operation intent explicit.

SOURCE TSIR tree

TARGET

Generated CSL

PE kernels, routing, memory movement, and run order emitted by Doe.

DOCS CSL architecture

CHECK

Diff against browser

Hardware or simulator output compared to the browser reference with a recorded tolerance.

SCHEMA Proof file

Walk this loop with real bytes from the Gemma 4 31B splice receipt: /c/forge — recorded Doppler reference, live in-browser CSL tail, hash-bound receipt, in one tab.

03 · TECHNICAL MAP

WGSL concepts mapped to CSL

Both runtimes separate the math from the machine layout. Doe preserves that split while emitting CSL targeting Cerebras hardware.

WGSL and JS	Generated CSL and run script	Why it lowers
`@workgroup_size` and workgroup tiles	PE grid and tile dimensions	Parallel topology is explicit instead of inferred.
Storage buffers and typed array views	Named CSL symbols plus scripted input/output movement	Dtype, shape, and visibility stay auditable.
JavaScript dispatch call	Scripted launch on Cerebras hardware	The run boundary is recorded in the proof file.
`workgroup_id` and `local_invocation_id`	PE coordinates and tile indices	Indexing intent survives lowering.
Barrier or subgroup collective	Fabric collective or explicit host step	Synchronization becomes a declared target decision.
Kernel source + JS wrapper	Generated CSL plus proof file	The proof file binds source identity, generated code, and output comparison.

DOCS Doppler bundle SOURCE Doe TSIR DOCS CSL architecture SOURCE emit_csl_matmul.zig

04 · SIDE BY SIDE

Visible kernel lowering

GEMV/matmul is the main bridge for Q4K -> f16 prefill and lm_head. RoPE shows transformer-specific position math and layout-sensitive indexing. Both are visible source-to-target examples, not opaque compiler output.

RoPE SOURCE rope.wgsl

let cos_val = freqs_cos[freq_idx];
let sin_val = freqs_sin[freq_idx];
let x0 = input[base_idx + first_idx];
let x1 = input[base_idx + second_idx];

let y0 = x0 * cos_val - x1 * sin_val;
let y1 = x0 * sin_val + x1 * cos_val;

input[base_idx + first_idx] = y0;
input[base_idx + second_idx] = y1;

RoPE SOURCE rope.csl

const q0: f32 = buf_out[q_base + a + 0];
const q1: f32 = buf_out[q_base + a + 1];
const qc: f32 = rope_cos_at(kv_len_per_head, d);
const qs_: f32 = rope_sin_at(kv_len_per_head, d);

q_rot[a + 0] = (qc * q0) - (qs_ * q1);
q_rot[a + 1] = (qs_ * q0) + (qc * q1);

Where authoring shapes split: WGSL stays in a 16×16 workgroup with shared-memory tiles, CSL declares a P×P PE grid with explicit row/column broadcasts. Variant selection on the Doppler side lives in a JSON rule map; on the Doe side it lives in which emitter runs.

Matmul (tiled f16) SOURCE matmul_f16_tiled.wgsl

const TILE_M: u32 = 64u;
const TILE_K: u32 = 16u;
var<workgroup> tileA: array<f16, 1024>;

@compute @workgroup_size(16, 16, 1)
fn main(...) {
    for (var t: u32 = 0u; t < num_tiles; t = t + 1u) {
        // load tile, barrier, accumulate 4x4 register tile
    }
}

Matmul (SUMMA on P×P PE grid) SOURCE emit_csl_matmul.zig

param Mt: i16; param Kt: i16; param Nt: i16; param P: u16;

const mpi_x = @import_module("<collectives_2d/pe>",
                              .{ .dim_params = c2d_params.x });

var A_tile = @zeros([Mt * Kt]f32);
var C_tile = @zeros([Mt * Nt]f32);

mpi_x.broadcast(step, &A_tile, Mt * Kt, x_done_id);

After CSL emission, a rewrite pass packs the f16 output into 128-word chunks for per-PE export. This keeps per-kernel byte identity stable across single-layer and full-shape generated run plans.

RMSNorm post-emission pack SOURCE emit_csl_rmsnorm_pack.zig

for (@range(i16, 128)) |word| {
    const base = (@as(u32, word) + chunk_offset) * 2;
    const lo: u32 = @as(u32, @bitcast(u16, output[base]));
    const hi: u32 = @as(u32, @bitcast(u16, output[base + 1]));
    output_chunk_0000[@as(u32, word)] = lo | (hi << 16);
}

05 · HARDWARE CHECK

Next hardware result

The next missing evidence is a returned hardware trace from CS-3/WSE. The repo now has direct run commands for both Gemma 4 31B and Qwen 3.6 27B.

REFERENCE

Browser reference

Prompt, model identity, WebGPU output, and tolerance stay pinned as the comparison source.

TARGET

Generated CSL

The Doe toolchain emits CSL targeting Cerebras hardware, with shape, memory movement, routing, and code identity recorded in the bundle.

VERIFY

Evidence bundle

The archive can be verified before it is run. The latest clean bundle is pinned to Doe commit 37358a057 and contains both Gemma and Qwen run commands.

RETURN

Returned trace

A hardware run returns the trace and progress files. Those files tell us whether the model reached token output, matched the browser reference, or failed at a named hardware/runtime boundary.

ARCHIVE Verified bundle DOCS Hardware runbook DOCS Bundle README DOCS Archive pointer SOURCE Gemma run command SOURCE Qwen run command SOURCE Verifier

WebGPU to CSL for Cerebras hardware, with hash-checked proof files. Inference and validation only; not a training engine. · github.com/clocksmith/doppler · github.com/doe-gpu/doe · x@d4da.com