Gemma selected token
WebGPU runs <bos>The color of the sky is through real Gemma 31B state. Generated CSL consumes that hidden state and computes the output-head score for token 3730 ( blue). Difference from WebGPU: 0.0087417.
Doppler · Doe · Cerebras
Doppler runs the model locally in WebGPU. Doe turns the same model program into kernel code for Cerebras hardware and a run script. The proof file records the reference output, the generated code, and the numeric comparison.
The practical value: less per-model Cerebras hand-authoring for models that already have clean WebGPU implementations. Two model paths have proof files today: Gemma 4 31B and Qwen 3.6 27B.
01 · EVIDENCE
Before a WSE run, we can already check real model state from WebGPU against generated code for Cerebras hardware, compile both model paths, and name the local simulator boundary.
WebGPU runs <bos>The color of the sky is through real Gemma 31B state. Generated CSL consumes that hidden state and computes the output-head score for token 3730 ( blue). Difference from WebGPU: 0.0087417.
The same check now exists for Qwen 3.6 27B. WebGPU supplies the real final-prompt state; generated CSL runs final norm plus the selected output-head row for token 760. Difference from WebGPU: 0.0133286.
Gemma 4 31B has its full generated target set compiled cleanly. Qwen 3.6 27B now has its parallel target set compiled and packaged under the same Q4K-to-f16 model-weight path.
The repo now carries one-command hardware paths for both models. Each command fetches hosted model weights, verifies the evidence archive, compiles generated CSL, and runs against a provided endpoint.
For Gemma, a real full-size lm_head width tile transfers activation data, transfers trained weights, and reaches compute completion. The local CPU simulator then stops at device-to-host copyback. The full transcript needs a hardware run.
Also in the bundle: frozen browser reference outputs, per-kernel checks, width-tiled lm_head CSL, prep script, archive verifier, and publication boundaries.
02 · CHECK LOOP
The proof file records what WebGPU produced, what code was generated for Cerebras hardware, what ran, and how close the numbers were.
Walk this loop with real bytes from the Gemma 4 31B splice receipt: /c/forge — recorded Doppler reference, live in-browser CSL tail, hash-bound receipt, in one tab.
03 · TECHNICAL MAP
Both runtimes separate the math from the machine layout. Doe preserves that split while emitting CSL targeting Cerebras hardware.
| WGSL and JS | Generated CSL and run script | Why it lowers |
|---|---|---|
@workgroup_size and workgroup tiles |
PE grid and tile dimensions | Parallel topology is explicit instead of inferred. |
| Storage buffers and typed array views | Named CSL symbols plus scripted input/output movement | Dtype, shape, and visibility stay auditable. |
| JavaScript dispatch call | Scripted launch on Cerebras hardware | The run boundary is recorded in the proof file. |
workgroup_id and local_invocation_id |
PE coordinates and tile indices | Indexing intent survives lowering. |
| Barrier or subgroup collective | Fabric collective or explicit host step | Synchronization becomes a declared target decision. |
| Kernel source + JS wrapper | Generated CSL plus proof file | The proof file binds source identity, generated code, and output comparison. |
04 · SIDE BY SIDE
GEMV/matmul is the main bridge for Q4K -> f16 prefill and lm_head. RoPE shows transformer-specific position math and layout-sensitive indexing. Both are visible source-to-target examples, not opaque compiler output.
let cos_val = freqs_cos[freq_idx];
let sin_val = freqs_sin[freq_idx];
let x0 = input[base_idx + first_idx];
let x1 = input[base_idx + second_idx];
let y0 = x0 * cos_val - x1 * sin_val;
let y1 = x0 * sin_val + x1 * cos_val;
input[base_idx + first_idx] = y0;
input[base_idx + second_idx] = y1;
const q0: f32 = buf_out[q_base + a + 0];
const q1: f32 = buf_out[q_base + a + 1];
const qc: f32 = rope_cos_at(kv_len_per_head, d);
const qs_: f32 = rope_sin_at(kv_len_per_head, d);
q_rot[a + 0] = (qc * q0) - (qs_ * q1);
q_rot[a + 1] = (qs_ * q0) + (qc * q1);
Where authoring shapes split: WGSL stays in a 16×16 workgroup with shared-memory tiles, CSL declares a P×P PE grid with explicit row/column broadcasts. Variant selection on the Doppler side lives in a JSON rule map; on the Doe side it lives in which emitter runs.
const TILE_M: u32 = 64u;
const TILE_K: u32 = 16u;
var<workgroup> tileA: array<f16, 1024>;
@compute @workgroup_size(16, 16, 1)
fn main(...) {
for (var t: u32 = 0u; t < num_tiles; t = t + 1u) {
// load tile, barrier, accumulate 4x4 register tile
}
}
param Mt: i16; param Kt: i16; param Nt: i16; param P: u16;
const mpi_x = @import_module("<collectives_2d/pe>",
.{ .dim_params = c2d_params.x });
var A_tile = @zeros([Mt * Kt]f32);
var C_tile = @zeros([Mt * Nt]f32);
mpi_x.broadcast(step, &A_tile, Mt * Kt, x_done_id);
After CSL emission, a rewrite pass packs the f16 output into 128-word chunks for per-PE export. This keeps per-kernel byte identity stable across single-layer and full-shape generated run plans.
for (@range(i16, 128)) |word| {
const base = (@as(u32, word) + chunk_offset) * 2;
const lo: u32 = @as(u32, @bitcast(u16, output[base]));
const hi: u32 = @as(u32, @bitcast(u16, output[base + 1]));
output_chunk_0000[@as(u32, word)] = lo | (hi << 16);
}
05 · HARDWARE CHECK
The next missing evidence is a returned hardware trace from CS-3/WSE. The repo now has direct run commands for both Gemma 4 31B and Qwen 3.6 27B.
Prompt, model identity, WebGPU output, and tolerance stay pinned as the comparison source.
The Doe toolchain emits CSL targeting Cerebras hardware, with shape, memory movement, routing, and code identity recorded in the bundle.
The archive can be verified before it is run. The latest clean bundle is pinned to Doe commit 37358a057 and contains both Gemma and Qwen run commands.
A hardware run returns the trace and progress files. Those files tell us whether the model reached token output, matched the browser reference, or failed at a named hardware/runtime boundary.
WebGPU to CSL for Cerebras hardware, with hash-checked proof files. Inference and validation only; not a training engine. · github.com/clocksmith/doppler · github.com/doe-gpu/doe · x@d4da.com