Bit-Neurons That Organize Themselves: Memory and Emergence Without Gradients
The Hypothesis
The bet is simple to state. A neuron is a bit-pattern in and a bit-pattern out. Alone it does little. Connect many and let their patterns interfere and structure should emerge: recognition, memory, specialization. No gradients, no weight matrices, no floating point. Just bits and a few bitwise operations.
This post is the state of that bet: what we measured, what failed, and what we haven’t run. The short version is that the core works on real data, gradient-free and integer-only, but only after we fixed how the neurons compete, and only on inputs encoded to a uniform density.
How It Works
Everything is one value, a fixed-width bit-vector, plus a handful of bitwise ops. Two do most of the work.
Resonance measures how well an input matches a stored pattern. The basic form is shared bits minus differing bits. Higher means a closer match.
Learning flips bits toward a target through a mask that decides which bits may change:
next = current XOR ((target XOR current) AND mask)
No gradients. The mask is the only knob.
The choice that matters is what counts as a neuron’s output. We read the pattern it holds, not a score. A score collapses the neuron to one number and turns the system into a nearest-neighbour classifier. The pattern keeps the information that lets neurons remember and organize.
A population learns by competition. For each input, the neuron that resonates most is the only one that updates. That is winner-take-all, and bounding the interference this way is load-bearing, for reasons the results below make concrete.
What We’ve Tested
| Approach | Method | Result | Status |
|---|---|---|---|
| Self-organizing memory, real digits | online winner-take-all with a density-fair score; recall and novelty in one loop | clusters self-organize to 0.48 purity (vs 0.10 chance); recall 3.8–4.5× chance at 10% corruption; novelty AUROC 1.00; running all three together degrades none | Measured |
| Same loop, a second dataset | identical loop, inputs rank-encoded to uniform density | purity 0.49–0.53; recall 4.7–5.0× chance; novelty AUROC 1.00; composition within ±2pp | Measured |
| Unsupervised specialization, synthetic clusters | 16 neurons, winner-take-all bit-flip | one neuron per cluster, purity 1.0 with enough neurons, holds at 20% noise | Measured |
| Associative recall, synthetic | store by bit-flip, recall from a corrupted cue by resonance | 256-bit neurons recall up to 128 patterns at 10% corruption, no ceiling reached | Measured |
| Confidence / novelty, synthetic | low resonance everywhere flags out-of-distribution | novelty AUROC 0.81 (to 50% corruption); recall-correctness AUROC 0.87 in-capacity | Measured |
| Continual learning | settle on clusters, then stream new ones in | with spare neurons 80–100% retained; below capacity retention falls to ~0.65 | Measured |
| Pattern classifier (the detour) | per-patch best-alignment over stored prototypes | 91.94%, up from 84.56% flat and past an 88.28% capacity ceiling | Measured |
| Two-layer composition | detect parts, then compose | 30.7%, lost to 88.2% flat | Measured |
| Classifier as an executable graph | run the same math through the graph engine | 0 of 50 predictions differ from the host computation, ~0.1 ms/image at 50 neurons | Measured |
| Large network on the graph path | — | bit-exact only verified at small scale | Built, not validated |
| Compile to a tight bitwise loop | — | scoped, not built | Built, not validated |
Self-organization, and the fix that made it work on real data
We stream unlabelled patterns into a group of neurons and let them compete.
flowchart LR
In["unlabelled input"] --> Cmp["who resonates most?<br/>(density-fair score)"]
N1["neuron 1"] --> Cmp
N2["neuron 2"] --> Cmp
N3["neuron ..."] --> Cmp
Cmp --> Win["winner only"]
Win --> Learn["winner flips bits toward the input"]
Learn -.->|"over many inputs"| Spec["one specialist per cluster"]
On synthetic clusters this worked at once. With at least as many neurons as clusters, each cluster claimed its own neuron, purity 1.0, and it held at 20% noise.
Real handwritten digits broke it. Every neuron drifted to the single sparsest digit. Purity fell to 0.11, about the chance floor. The unbounded control we keep as a warning case did not blur on real data; it covered all ten digits at 0.27–0.35 and beat the competitive version. So the synthetic win had been measuring how cleanly synthetic clusters separate, not the mechanism.
The cause was the resonance score. Shared-minus-differing rewards sparse patterns, so on digits that all share strokes every neuron slid to the same low-density blob, prototypes filling to about 360 of 784 bits. We swapped in a density-fair score, shared bits over their union (a Jaccard-style ratio). Prototypes settled sparse and distinct at about 190 of 784 bits, purity rose to 0.48, and the competitive loop beat the unbounded control again, 0.48 vs 0.35. The same inter-neuron distance check that detects blur showed the competitive neurons spread about 219 bits further apart than the unbounded ones, so this is real specialization, not hidden blur.
The ceiling is real too. 48% unsupervised purity is decent clustering, not classification, and the most confusable digits go uncovered when neurons are scarce.
Recall from a damaged cue
A neuron stores a pattern and emits it back. Give the population a corrupted version and it returns the original by resonance.
flowchart LR
Cue["corrupted cue"] --> R["resonance vs each stored pattern"]
M1["stored pattern A"] --> R
M2["stored pattern B"] --> R
M3["stored ..."] --> R
R --> W["best match wins"]
W --> Out["recalled PATTERN<br/>(not a label)"]
On synthetic patterns, 256-bit neurons recalled up to 128 stored patterns at 10% corruption without hitting a ceiling. Wider vectors store more and tolerate more damage. On real digits the same recall, run inside the self-organizing loop, returned the right class 3.8–4.5 times more often than chance at 10% corruption and recovered about 0.86 of the stored bits. It recalls the gist of a cluster, not an exact copy, which for a memory is the behaviour we want.
One loop, and a second dataset
The three behaviours, self-organize, recall, flag novelty, run as one online loop rather than three programs. On real digits, running them together did not degrade any of them against measuring each alone, and novelty detection on noise reached AUROC 1.00.
Then the real test of a mechanism: change the dataset. We ran the same loop, untouched, on a clothing-image set. With the same encoding it failed and even inverted, the competitive neurons ending up more blurred than the unbounded ones. The cause was density again. These images set about 32% of their bits, with every class packed into one band, so the density-fair score lost its grip.
The fix was the input. We re-encoded each image by rank: keep the brightest k pixels, set exactly those, so every image carries the same number of bits, landing near the 15% density the digits happened to have. With nothing else changed, the loop came back: purity 0.49–0.53, recall 4.7–5.0× chance, novelty AUROC 1.00, composition within 2 points. So the mechanism carries to a second real dataset on one condition: the inputs have to be encoded to a uniform density first. The digit benchmark only ever worked because it was already close to uniform.
Forgetting is a capacity limit, not decay
We let the loop settle on one set of clusters, then streamed new ones in. With spare neurons, nothing broke: new clusters took idle neurons and the old specialists kept their patterns, 80–100% retained. Forgetting only appeared when we forced fewer neurons than clusters, where retention dropped to about 0.65. Shrinking a specialized neuron’s mask after it settles recovered most of that, 0.68 to 0.87 in the tight case, until too few neurons were left. So forgetting here is a capacity wall, which points at adding neurons rather than changing the rule.
What Didn’t Work
The classifier was the wrong shape. Chasing accuracy, we built a recognition unit that scores how well an input matches stored prototypes and reached 91.94%, up from 84.56% flat and past an 88.28% capacity ceiling. The gain came from comparing local patches and letting each find its best small shift, structure beating brute template count. But to get a score we collapsed the neuron’s output to one number. That is a nearest-neighbour classifier, not the idea. The number is fine; it just doesn’t measure remembering or organizing.
Stacking layers bought nothing on digits. A two-level version, detect parts then compose them, scored 30.7% against 88.2% for the flat version. Forcing the image through a sparse part-code threw away the spatial detail digit identity lives in. The lesson is that this benchmark isn’t a composition problem, not that depth is wrong.
Unbounded interference collapses to blur on synthetic data. Let every neuron absorb every input and they converge to the same pattern. This is how an earlier version of the idea failed. The same loop with the gate on versus off is the clearest evidence we have that bounded competition is what makes structure appear. On real digits the failure mode flipped to the score bias above, which is why the gate alone wasn’t enough there.
The Untested Frontier
- Is uniform-density encoding a general fix? It carried the loop from one dataset to a second once inputs were rank-encoded. We don’t know if that’s a portable rule or a coincidence that fits these two. A third, different dataset would test the fix itself. (mechanism)
- A task that isn’t an image benchmark. Everything ran on synthetic clusters and two image sets. A different modality would test the core claim properly. (mechanism)
- Recovering the confusable classes. Competition leaves the hardest-to-separate classes uncovered when neurons are scarce. Whether adding neurons on demand fixes that or only defers the wall is open. (structural)
- The neuron as a 3-D volume. The original idea was a volume, not a flat string of bits. We haven’t tried it and suspect it adds nothing without a reason for the third dimension to mean something. (structural)
- Scale. The neurons run as a graph and match the host math exactly at fifty neurons, ~0.1 ms/image. A large network would need compiling to a bitwise loop, which we’ve scoped but not built. (structural)
Open Questions
- Is rank, uniform-density encoding a portable fix across datasets, or does it only happen to suit the two we’ve run?
- Below the capacity wall, does adding neurons on demand restore continual learning, or just move the cliff?
- Is there a task, not a digit or clothing benchmark, where composing parts in layers actually helps?
- Can the most confusable classes get their own specialist without supervision?
- Does a 3-D neuron ever earn its third dimension, or is a flat pattern enough?
Considerations
We read the neuron’s pattern as its output instead of a score. We give up a ready-made class label and get memory and self-organization, which a single number can’t carry.
We bound the competition to one winner. It’s less natural than letting every signal sum, but the free-for-all blurs everything to the same pattern, and the gate is why structure appears.
We made the score density-fair and, when that wasn’t enough, made the inputs uniform-density. That is a preprocessing assumption, not “works on anything raw.” Without it the neurons collapse onto the densest blob; with it the same loop carried to a second dataset.
We treat the 91.94% classifier as a side-quest. It’s a fine number, but the things we care about, remembering and organizing and knowing what you don’t know, aren’t measured by it.