Matching Patterns with Bits: What Works, What Doesn't, and What's Untested

May 23, 2026

Languages:Rust

Patterns:Hamming-distance classificationBit-space k-meansBit-flip learningComposition over substrate growth

Architecture:Bit-vector value domainSubstrate consumerPrototype classifierSimilarity-by-composition

Tags:research

The Hypothesis

Spatium is built on one idea: you can recognize patterns using only bit-strings and Hamming distance. No transformers, no gradient descent, no learned weight matrices. A pattern is a string of bits. Recognizing it means counting how many bits differ from a stored example. Learning means flipping bits.

There are no floats anywhere in the recognition path. The question we’re trying to answer is just: how far does this get you? This post is where things stand. What we measured, what failed, and the large parts of the idea we still haven’t tested.

How It Works

Everything is built from one value type, a bit-vector tagged with its width, and a small set of bitwise operations: XOR, AND, popcount (counting set bits), and a few others. There’s no special “neuron” type. A recognition unit is just a stored pattern, a distance calculation, and a threshold.

flowchart LR
    In["input bits<br/>(width-typed)"] --> H["Hamming distance<br/>popcount(input XOR stored)"]
    Stored["stored pattern"] --> H
    H --> Th{"distance &lt; threshold?"}
    Th -- yes --> Fire["unit fires"]
    Th -- no --> Quiet["silent"]

The first experiments checked that this composes and behaves. A bit-vector survives a round trip through the execution engine without a single bit changing. A hand-built similarity unit gives the exact activation we expect on every input. And with three units side by side, each holding a different pattern, only the matching one fires, even when the input differs from a stored pattern by a single bit.

Once those held, the real question was accuracy on real data.

What We’ve Tested

We ran this on MNIST: the 10,000-image test set, accuracy as the score, with the final pick made in plain code over the scores the graph produces. Here’s everything we measured.

Approach	Method	Result	Status
Single prototype / class	One average pattern per digit; nearest by Hamming	71.35%	Measured
Multi-prototype (K=5)	Bit-space k-means, 5 patterns per digit	80.68% (+9.33pp)	Measured
Shift-tolerant Hamming	Compare against shifted copies; keep the best	83.42% (+2.74pp)	Measured
Asymmetric capacity	Give the hardest digit more patterns	+≥5pp on digit 8	Measured
Shared-minus-different kernel	Score = shared bits − different bits	84.56% (+3.88pp)	Measured
8-bit Gray encoding	Same kernel, 8× wider bit-vectors	81.51% (−3.05pp)	Measured, lost
Bit-flip refinement	Nudge patterns toward/away from samples	≤75.15%	Measured, lost
Bit-flip learning rule	Masked XOR convergence toward a target	converges cleanly	Measured
Composable recognition unit	A reusable, author-friendly neuron	—	Built, not validated

Each step bought real accuracy. We always compared the change against the thing it replaced, not against zero.

flowchart LR
    A["single prototype<br/>71.35%"] --> B["multi-prototype K=5<br/>80.68%"]
    B --> C["shift-tolerant<br/>83.42%"]
    C --> D["shared-minus-different<br/>84.56%"]
    B -. "8-bit gray encoding" .-> E["81.51% ✗"]
    D -. "bit-flip refinement" .-> F["≤ 75.15% ✗"]

The biggest single gain didn’t come from more patterns or more compute. It came from changing how we measure similarity.

Plain Hamming distance has a bias. It quietly favors patterns with few set bits, because a mostly-empty pattern looks “close” to anything sparse. Our digit patterns aren’t balanced. The number of set bits ranges from about 56 for a thin digit like 1 to about 141 for a fat digit like 0. So the bias is real, and it costs accuracy.

The fix is to score by shared bits minus different bits, instead of just counting differences:

score = popcount(A AND B) − popcount(A XOR B)
      = (popcount(A) + popcount(B)) / 2  −  1.5 · Hamming(A, B)

The second line is the same formula rewritten. It shows what’s going on: the score now includes a term for each pattern’s own density, which cancels out the bias toward sparse patterns. Picking the pattern with the highest score got us to 84.56%, up 3.88 points from the plain multi-prototype version. The worst digit (8) improved by almost 14 points on its own, with no special handling.

flowchart LR
    S["test sample"] --> AND["AND with pattern"]
    S --> XOR["XOR with pattern"]
    P["50 patterns<br/>(5 per digit)"] --> AND
    P --> XOR
    AND --> PC1["popcount<br/>= shared bits"]
    XOR --> PC2["popcount<br/>= different bits"]
    PC1 --> Sub["score = shared − different"]
    PC2 --> Sub
    Sub --> Arg["highest score wins"]
    Arg --> Cls["digit = index ÷ 5"]

We also tested whether a bit-string can learn a target with no gradients at all. The rule is one line. Flip the bits that disagree with the target, but only the ones a mask allows:

next = current XOR ((target XOR current) AND mask)

target XOR current is the set of bits that disagree. AND mask keeps only the ones we’re allowed to change. The outer XOR flips them. Wired as a feedback loop, this settles in a predictable way. A full mask reaches the target in one step. An empty mask freezes the pattern. A partial mask settles to a stable point, and the leftover error is exactly the number of bits the mask never let it touch. You can predict the whole thing by hand.

flowchart LR
    Cur["current pattern"] --> X1["XOR"]
    Tgt["target"] --> X1
    X1 --> A["AND"]
    Mask["mask = which bits may change"] --> A
    A --> X2["XOR"]
    Cur --> X2
    X2 --> Next["next pattern<br/>(allowed bits flipped)"]
    Next -. "feeds back each step" .-> Cur

What Didn’t Work

Two experiments lost. Why they lost was more useful than another point of accuracy would have been.

The 8-bit Gray encoding dropped us by 3 points. The idea seemed sound: instead of one bit per pixel, encode each pixel’s brightness as an 8-bit Gray code, where similar brightness values differ by one bit. So “almost the same shade” would read as “almost the same bits.” It came out to 81.51%, and the worst-hit digits each lost about 9 points. Making the bit-vector 8 times wider added noise faster than the extra detail added signal. More bits isn’t more information if most of those bits don’t carry any.

The bit-flip refinement lost on every configuration we tried. We took the shared-minus-different patterns and tried to polish them: nudge each one toward samples it got right and away from ones it got wrong. All 13 settings lost. The best peaked at 75.15%, and most landed 10 to 13 points below the 84.56% baseline. The reason is the interesting part. The refinement pulls every pattern’s bit-count toward the average, and that’s exactly the density difference the shared-minus-different score relies on. The trainer and the scoring method aren’t separate dials. Tuning one against the other’s assumptions makes things worse.

The Untested Frontier

This is where most of the original idea still sits. The results above are about fixed, pre-computed patterns. The first version of Spatium was meant to be something more alive. None of that has been run yet. Some of it is built but never validated; some has never been attempted at all.

Shape questions, the things we haven’t built:

A real reusable recognition unit. A unit with inputs for signal, learning target, and which bits may change, and outputs for its score and its stored memory. It exists as a recipe and registers correctly, but it’s never been run end-to-end as something an outside author wires up, and it has no accuracy number of its own. This is the “what shape is a Spatium neuron” question, and a lot rides on the answer being easy to build from outside.
A multi-step think cycle as a graph. A perceive, resonate, settle, spread, learn, output pipeline. Sketched, never assembled or run.
Structure that emerges on a fixed graph. Can regions form just from per-connection plasticity, without the engine adding new connections at runtime? Never tried. It’s the surviving piece of the original “position is context” idea.
One big unit vs. many small ones. Does a single high-capacity recognizer beat a crowd of narrow ones? Untested.

Mechanism questions, the dynamics we haven’t measured:

Learning over time. Our bit-flip rule only settles and holds. Continuous, ongoing adaptation, and a rule that locks a pattern in as confidence grows, is designed but not measured.
Recall by binding. Store a pattern bound to a rotated reference, then recover it. Plausible to build from what we have. Needs one new rotate operation. Not built.
Other scoring methods that keep the density signal intact. This is the one thing that might let a trainer help the shared-minus-different patterns instead of fighting them.

Open Questions

Does the reusable recognition unit hit the same 84.56% when it’s assembled from outside, or does the wiring get in the way?
Is there a learning rule that improves the shared-minus-different patterns instead of flattening them?
Can a richer input encoding beat the plain one-bit-per-pixel version without drowning the signal the way Gray code did?
Does any real structure form from connection-local plasticity, or does a fixed graph cap what’s possible?
How far past 84.56% can this go before it hits a real ceiling, and where is that ceiling?

Considerations

We score by shared-minus-different bits instead of raw Hamming distance. It costs a little more to compute, but it cancels a bias that was quietly costing us about 4 points.

We kept the plain one-bit-per-pixel input after the Gray encoding lost. It’s a coarser input, but the wider version had a worse signal-to-noise ratio.

We treat the fixed-pattern classifier as the floor of the idea, not the ceiling. The more ambitious, structural version of Spatium is still just a hypothesis, and saying otherwise wouldn’t be research.