Under the Hood: Engineering Commonware Fuzzing

Often, fuzzing is a collection of ad hoc harnesses: someone writes a target, runs it, files the crashes, and moves on. Instead, we treat fuzzing as infrastructure. Fuzz targets, along with their generators, seed inputs, regressions, review rules, and platform execution, are engineered into the test system, so fuzzing produces durable coverage rather than one-off crashes.

Commonware is an open set of Rust-based blockchain primitives for building specialized blockchains or onchain applications with high-throughput and low-latency. It provides low-level cryptographic, storage, runtime, networking, and consensus implementations where vulnerabilities may lead to denial-of-service, state divergence, double spending, or improper authentication, depending on the target system's threat and asset models.

Given the role of these primitives in adversarial systems, fuzz tests are first-class tests in Commonware, treated as a “source of truth” similar to unit and integration tests. All fuzz tests that live in the main branch are reviewed by Commonware engineers. Experimental and research fuzzing live in separate branches, only moving into the main branch after rigorous code review and hours of execution on the fuzzing platform.

With Commonware, our fuzzing work is organized into two lines: production and research engineering. The production line runs continuously and is intended to find, reproduce, and regress bugs in the monorepo codebase. It has already found more than 60 bugs across multiple primitives and now serves as a regression harness for those cases. The research engineering line explores methods that are not yet mature enough for continuous deployment, but may improve how we test complex distributed protocols.

Runtime and Infrastructure

The core fuzzing primitive is LibFuzzer. Most Commonware fuzz targets use structured input generation through Rust’s arbitrary crate. Rather than treating fuzzer input only as an untyped byte stream, targets decode it into bounded operation plans, protocol messages, storage actions, schedules, or configuration choices. This lets the fuzzer spend more time exploring meaningful, valid, and near-valid states, while still preserving mutation-driven discovery.

Continuous execution is handled by our internal fuzzing platform, similar in purpose to ClusterFuzz: it runs fuzzers continuously, tracks coverage, reports crashes, stores artifacts, and supports triage. The important property is operational rather than technical: fuzzing isn't a periodic campaign, but runs 24 hours a day. All public fuzzers are open source and live in the relevant crates of the Commonware monorepo.

Fuzzer-Controlled Randomness with FuzzRng

Controlled randomness is a key part of fuzzing. Many targets need a large amount of data to instantiate and fill the system: keys, messages, storage values, network schedules, protocol choices, and fault configurations. One simple approach is to put a seed in the fuzz input, initialize a standard Rust PRNG from that seed, and let the PRNG generate the rest.

That is reproducible, but it isn't very useful as fuzzing feedback. LibFuzzer mutates bytes and learns from nearby changes in coverage or crashes. If a mutation changes a seed from 1 to 3456297643, the target doesn't get a slightly different system state, it gets a completely different random stream. The fuzzer is still mutating the seed, but the generated execution has little locality with the previous one, so mutation and minimization become much less effective.

This is related to the broader initial idea of grey-box fuzzing inside deterministic simulation. One approach is to record a concrete message trace from a normal test, then replay mutated versions of that trace under LibFuzzer. FuzzRng and the deterministic runtime don't implement that full trace-recorder and trace-replayer model. Instead, they provide a lightweight version of the same principle: expose the hidden nondeterminism of the execution to LibFuzzer as mutable input bytes. The deterministic runtime makes scheduling, time, storage, and task execution replayable. FuzzRng makes the large amount of randomness needed to instantiate and fill the system mutation-friendly. Together, they let protocol fuzzers explore executions under coverage feedback without first materializing every run as a recorded message trace.

Commonware implements FuzzRng to avoid the seeded-PRNG problem. FuzzRng turns the fuzzer's raw byte input into an infinite deterministic RngCore stream. For each 64-bit output block, it reads a wrapping eight-byte window from the input, mixes in a block counter and domain constant, and applies a SplitMix64-style finalizer. This gives the stream enough diffusion to behave like useful test randomness while preserving locality: a small mutation in the fuzz input affects only nearby random-output blocks rather than replacing the entire stream.

This matters for fuzzing complex systems like the Commonware stack. In many protocol and storage targets, the explicit fuzz input describes the high-level operation plan, while FuzzRng drives the hidden degrees of freedom inside the execution: simulated runtime randomness, network fault choices, generated values, adaptive partitions, and helper-level sampling. When a crash is found, the artifact contains both the structured plan and the entropy tape needed to replay it.

FuzzRng is plugged directly into the deterministic runtime through deterministic::Config::with_rng. Together, they make executions reproducible at two levels: the runtime controls scheduling, time, storage, and task execution, while FuzzRng controls all randomness consumed during the run. For consensus protocol fuzzing, this is especially important because a failure may depend on a specific combination of message ordering, timer behavior, network faults, and randomized protocol choices.

Production Fuzzing

We divide production fuzzers into three classes.

Primitive Fuzzers

Primitive fuzzers exercise individual primitives in isolation. Typical targets include parsers, codecs, consensus, storage, network helpers, and cryptographic primitives.

These fuzzers are cheap to write and verify, and very fast to run compared to consensus protocol fuzzers. They are also good regression tests: once a bug is found, the minimized input becomes a durable witness for the violated invariant.

Protocol Fuzzers

Protocol fuzzers exercise larger constructions. Examples include consensus protocol primitives and threshold cryptographic mechanisms. These targets are harder to engineer because they require more state: actors, storage, network, p2p channels, runtime, broadcast, and protocol reconfiguration.

For consensus fuzzing, we use standard techniques for distributed systems testing: deterministic runtime execution, Twins-style Byzantine behavior, and ByzzFuzz-style fault injection. The deterministic runtime is essential and a precondition for efficient fuzzing and testing of any distributed protocol. It allows failures to be reproduced as executions, not just as inputs. This is especially important for consensus, where the bug is often not in a single message but in the ordering of messages, timers, and actor transitions.

Spec-Driven ByzzFuzz Development

ByzzFuzz is developed with a spec-driven development approach. The implementation is not treated as the source of truth for the fuzzing method. Instead, the expected behavior of the harness, fault scheduler, network interception layer, process injector, liveness checks, and invariants is documented in an informal specification before major implementation changes are made.

We use LLM agents as engineering assistants in this workflow, but not as authorities. The specifications define the architecture, contract, ADRs, domain, and invariants. An LLM agent translates the specification into Rust code or vice versa, identifies inconsistencies between them and keeps them in sync, or summarizes the expected behavior of a primitive, verifies that the implementation and the specification correspond to each other, and makes necessary changes in both of them if needed.

This is especially useful for consensus fuzzing because many bugs in the fuzzer itself look like protocol bugs. A malformed fault schedule, an unrealistic Byzantine action, or an incorrect liveness oracle can produce misleading findings. The spec-driven development gives us a stable reference point: when a fuzz result is surprising, we can ask whether the implementation violated the protocol, the fuzz target violated the specification, the specification failed to model the intended adversary, or the ByzzFuzz method was incorrectly defined in the specification or implemented in the code.

In practice, ByzzFuzz development follows a loop:

Update the relevant part of the specification or code to reflect the behavior being changed.
Sync the spec with the code or vice versa.
Run the ByzzFuzz target and inspect crashes, traces, and decision logs.
Promote only reproducible and spec-consistent findings into the normal triage flow.

This keeps LLM assistance bounded. The LLM can accelerate implementation and review, but the durable artifacts are the specifications, tests, fuzz targets, minimized inputs, and reproducible traces.

LLM-Driven Fuzz Targets

LLM-based vulnerability scanning gives us another source of candidate bugs. This workflow is based on a local implementation of Shellphish’s DiscoveryGuy-style approach within the open-source AIxCC Sherpa tooling. DiscoveryGuy combines harness selection, vulnerability reasoning, and seed generation, then uses fuzzing to validate whether the generated seed or nearby mutations actually trigger a crash.

We use fuzzing as an oracle for this workflow. The process is:

An LLM agent identifies candidate bugs that could trigger panics, memory-safety issues, or invariant violations.
The agent generates a fuzz target as a proof of concept.
The agent derives seeds and runs the fuzz target.
If the finding is reproduced, the fuzz target becomes a candidate for the production fuzzing core set.

This separates speculation from evidence. The LLM may suggest a bug. The fuzzer must reproduce it.

For example, this approach found a trivial arithmetic overflow in the Reed-Solomon encoder when fuzzing config parameters. The generated target varied minimum_shards, extra_shards, input data, and shard-state vectors, then called ReedSolomon<Sha256>::encode. This reproduced an attempted addition overflow in coding/src/reed_solomon/mod.rs, turning the LLM-generated hypothesis into a concrete fuzzing finding.

A second example shows a different feedback loop. The LLM-generated target for EvaluationVector::empty found a shift-left overflow panic in math/src/ntt.rs when lg_rows was large enough that 1 << lg_rows overflowed. The issue was not ultimately fixed by changing the implementation. Instead, the fix clarified the function’s precondition: 2^lg_rows must fit in a usize. Once that invariant was documented, subsequent LLM-generated fuzz targets used it as an input constraint, so the target no longer treated out-of-contract inputs as bugs.

Experimental Fuzzing

Some bugs are difficult to reach with ordinary harnesses. This is especially true for distributed systems, where the interesting path may require a specific setup, coordination of network and process faults, a sophisticated attack, a message schedule, or an internal transition. We are therefore exploring two research directions.

LLM-Assisted Security Requirement-Guided Fuzzing

This is an experimental LLM-driven security verification workflow for the Commonware monorepo. Instead of relying solely on manually written fuzz harnesses or isolated checklist reviews, it combines LLM agents, project-specific invariant extraction, static security reasoning, and fuzzing into a single repeatable process.

The workflow begins by using LLM agents to understand the codebase's semantics. The agents identify core entities, trust boundaries, protocol roles, security policies, and invariants across Commonware’s different primitives. This is especially important for distributed systems, where consensus, networking, storage, cryptography, and runtime layers may each have their own notions, assumptions, and safety properties. From this analysis, the agents build a project-specific database of technical requirements, similar to OWASP ASVS. This database records what properties must hold, where they are enforced, which primitives depend on them, and what kinds of violations may lead to security issues.

The agents then verify whether these properties and invariants actually hold in the implementation. They compare expected behavior against source code, documentation, tests, and traditional secure-engineering rules. This helps identify places where the code may violate trust boundaries, skip validation, mishandle protocol state, rely on unsafe assumptions, or diverge from documented behavior.

Potential issues are not treated as findings immediately. LLM agents prioritize candidates where fuzzing can provide strong evidence. Fuzzing is used as an oracle: it attempts to determine whether a suspected vulnerability can be triggered by realistic inputs or executions.

When fuzzing confirms a candidate, the result becomes concrete evidence for a vulnerability. When fuzzing cannot reach the suspected behavior, the candidate is weakened or rejected. Each result is fed back into the verification and vulnerability database, improving future analysis and reducing duplicate work.

In short, the workflow follows this loop:

Identify core entities and security policies.
Build project-specific invariants and a vulnerability database.
Verify whether the invariants hold.
Select high-value fuzzing candidates.
Use fuzzing to prove or falsify vulnerability claims.

This approach is experimental, but it is well-suited to complex distributed systems where important security properties are spread across multiple primitives, documents, tests, and implementation paths.

Model-Guided Consensus Fuzzing

The second experimental direction is model-guided fuzzing for consensus protocols, currently focused on Commonware’s implementation of Simplex.

In this approach, the feedback signal is not code coverage. Instead, we use executable specifications written in Quint and TLA+ to measure state coverage. The fuzzer is guided toward protocol states and transitions that matter at the model level, not only toward new basic code blocks in the implementation.

This is still research. The goal is to connect implementation fuzzing with protocol-level reasoning. Code coverage tells us which implementation paths were executed. Model coverage tells us which protocol behaviors were exercised. For consensus protocols, both signals are useful, but they answer different questions.

Current Status

The production fuzzing line is part of Commonware’s continuous security process. Primitive fuzzers, protocol fuzzers, and validated LLM-generated fuzzers are run continuously on AR’s internal fuzzing platform, totaling 90 fuzz targets.

LLM-derived security requirement-guided fuzzing and model-guided consensus fuzzing are experimental. They are not yet part of the production pipeline. Their purpose is to test whether project-specific security requirements, derived from code comments and implementation context, can guide fuzzing toward bugs that ordinary harnesses and ordinary coverage feedback may miss.

All the fuzzing findings can be tracked in the Commonware monorepo issue list.

Conclusion

The main design principle is simple: fuzzing is infrastructure. Commonware’s codebase is engineered so fuzzers can be first-class parts of the test system, with explicit interfaces, documented invariants, reviewable targets, reproducible crashes, minimized inputs, and promoted regressions. The result is durable test coverage, not a collection of ad hoc harnesses.

This gives us two complementary modes of work. The production line continuously searches for concrete crashes and turns confirmed findings into regression coverage. The experimental line searches for better ways to expose protocol behavior to fuzzers. For distributed protocols, the deterministic runtime is the key enabler: it makes schedules, timers, network delivery, and actor transitions reproducible enough to fuzz efficiently. Both production and experimental fuzzing are necessary because the most important bugs often emerge not from isolated primitives, but from interactions between them.