Axo hunting food, avoiding hazards and learning in real time — with its brain (eye place-cells, motor neurons, smell/value and hunger drive) rendered live. Recorded straight from
--watch.
Axo is a small artificial creature with a brain that learns while it lives.
Most neural networks are trained once on a big dataset and then frozen. Axo is different: its brain is a simulated network of spiking neurons (like real ones, talking in electrical pulses) that keeps learning continuously as the creature moves around a tiny world — getting hungry, hunting food, avoiding danger — all in one unbroken "life".
The twist that makes it interesting: Axo learns using only local rules. Every connection (synapse) changes based only on what its own two neurons just did — there is no backpropagation, no global optimizer, no separate training phase. That's much closer to how a biological brain is thought to learn, and it's the question this project explores: how far can purely local learning go? The animation above is a real recording of Axo hunting, with its brain — what it sees, the action it picks, its hunger — drawn live next to the world.
This is a research/learning project built from scratch in C++23 + CUDA. It is not a library or a finished product — it's a series of runnable experiments, each honest about what works and where the walls are.
- An NVIDIA GPU and Docker with the NVIDIA Container Toolkit (so the container can reach the GPU).
- That's all — CUDA, the compiler and every other dependency live inside the container. Nothing is installed on your host.
Developed on an RTX 5090 (CUDA arch
sm_120, CUDA 13.3). On a different NVIDIA card, setCMAKE_CUDA_ARCHITECTURESinCMakeLists.txtto your GPU's compute capability.
docker compose run --rm build ./build/axo --watch
Loads (or, the very first time, grows) a creature and shows it living in real time, with its brain rendered next to the world — exactly like the animation at the top. Press Ctrl-C to stop; its progress is saved.
Want it to keep living across runs? --live runs one "life phase" (≈6000 steps), then saves the
brain to being/. Call it again and the same creature picks up where it left off, a little
smarter each time:
docker compose run --rm build ./build/axo --live
- Spiking neural network (SNN) — neurons that fire discrete pulses ("spikes") over time, like biological ones, instead of passing smooth numbers in a single shot.
- Local / Hebbian learning — a synapse changes based only on the activity of the two neurons it connects ("cells that fire together wire together"). No global error signal is broadcast.
- STDP (spike-timing-dependent plasticity) — the local rule in detail: if neuron A fires just before B their link strengthens, just after it weakens. This is how Axo's perception self-organizes, with no labels.
- R-STDP (reward-modulated STDP) — the same rule, but a global "dopamine" reward decides whether recent activity gets reinforced or suppressed. This is how Axo learns to act.
- Feedback Alignment — a trick to train hidden layers locally, without backpropagation, by sending the error back through fixed random connections. Lets depth be learned, not hand-wired.
- Continual learning / catastrophic forgetting — ordinary networks overwrite old skills when they learn new ones; a real brain keeps both. Axo learns new patterns without erasing old ones.
- "No backprop" — backpropagation (the standard training algorithm) needs a global backward pass and is considered biologically implausible. Everything here avoids it on principle.
Each ability is a self-contained, runnable experiment; the detailed sections follow below.
- 🐛 Lives — perceives, acts and learns in one continuous loop, no resets (
--live, Phase J) - 👁️ Learns to see — its place-cell vision self-organizes, unsupervised (Phase K)
- 🍎 Learns to act — hunts food from the consequences of its own moves (Phases A, B, D, J)
- 🧭 Gets curious — explores when full, driven by emergent novelty-seeking (Phase M)
- ☠️ Avoids danger & poison — remembers what hurt it, and anticipates poison before touching it (Phase X)
- 🧠 Learns depth without backprop — solves XOR and trains a hidden layer via Feedback Alignment (Phases E, F, G)
- ♾️ Doesn't forget — keeps absorbing new patterns without overwriting old ones (Phase I)
- 🔣 Grounds symbols — forms inner symbols tied to its own experience (Phases S, T)
# compile
docker compose run --rm build bash -c "cmake -S . -B build -G Ninja && cmake --build build -j"
# run the unit test suite (needs the GPU)
docker compose run --rm build ctest --test-dir build --output-on-failure
Visualizations (spike rasters, receptive fields, selectivity heatmaps) live in viz/:
docker compose run --rm viz bash -c "pip install -q -r viz/requirements.txt && python viz/plot_raster.py"
Everything below is the project's research log — one runnable experiment per section, ordered roughly as they were explored (not by difficulty), each honest about its limits. Skim freely; you don't need to read it top to bottom.
docker compose run --rm build ./build/axo --phase E
docker compose run --rm build ./build/axo --phase E-diag # proves: hidden code is separable
XOR is provably unsolvable with a single layer. The solution comes via the Marr-Albus cerebellum
model: a fixed sparse spiking hidden area (granule-cell-like random conjunctions) expands the
input nonlinearly, and a supervised delta-rule readout (Purkinje cell with climbing-fiber
teaching signal — local, no backprop) learns on top of it. The same readout on the raw input
fails (2/4), but on the hidden code it solves XOR (4/4) — which isolates the benefit of
depth. Note: purely local STDP/R-STDP forms only single-feature detectors and does NOT solve XOR
(shown via --phase E-diag); depth plus a teaching signal are required.
Phase F: Real local deep learning — the hidden layer learns BY ITSELF (Feedback Alignment) — validated
docker compose run --rm build ./build/axo --phase F
docker compose run --rm build ./build/axo --phase F-sweep # regime: where random weights don't separate XOR
In Phase E the hidden area was fixed (only the readout learned). Real deep learning means:
the hidden layer learns by itself — with credit assignment WITHOUT backprop and WITHOUT
weight transport. The mechanism is Feedback Alignment (Lillicrap 2016) as a spiking
three-factor rule: the output error e_k = target_k − rate_k is projected onto the hidden
layer through a fixed random feedback B (δ_j = Σ_k B[j,k]·e_k); each synapse updates purely
locally as ΔW = lr · (Pre×Post eligibility) · modulator. Output layer: modulator = e_k
(delta rule); hidden layer: modulator = δ_j (projected error). For this there is a new
primitive reward_update_vec (per-neuron modulator instead of a global scalar) — unit-tested
(test_fa).
For a small hidden area (H=16), in which random weights do not linearly separate XOR, FA rebuilds the hidden code purely locally: linearly separable before 2/4 → after 4/4. The equally sized frozen random depth stays at 2/4. This isolates the claim: the usable depth was learned, not handed over for free by random expansion. (The spiking 2-neuron greedy eval that is also printed is too noise-sensitive at H=16 and is only a secondary measure; what matters is the linear separability = what a local delta readout achieves.)
# place MNIST in data/ (train-images-idx3-ubyte, train-labels-idx1-ubyte)
docker compose run --rm build ./build/axo --phase G
The same mechanism, scaled up: 784 → H=200 → 10, spiking FA-learned hidden layer, local delta readout on the hidden rates. Three equally sized conditions separate "scaling" from "learning":
| Net (same size) | Test accuracy |
|---|---|
| flat (linear readout on pixels) | 82 % |
| deep, hidden frozen-random | 15 % |
| deep, hidden LEARNED via FA | 45 % |
At identical size (H=200), the learned depth beats the frozen random depth by ~3× — it's not the neuron count but the local learning of depth that delivers the performance (the LLM logic in miniature: large learned nets ≫ large random ones). Honestly: deep < flat, because the spike-rate encoding of raw pixels loses information on a nearly linear task — the depth advantage shows up on nonlinear XOR (Phase F), not on linearly separable MNIST. Together the two phases give the full picture of deep learning: learned depth beats random depth, and most strongly where the task demands nonlinearity.
docker compose run --rm build ./build/axo --phase I
A brain doesn't learn a single fixed task but keeps learning new things without unlearning the old. That is exactly the hallmark of a self-learning brain — and the point where standard AI fails (catastrophic forgetting: when it learns B, it overwrites A).
Here 14 overlapping patterns are introduced ONE AFTER ANOTHER — each old one is never shown again. After each new pattern we measure how many of the patterns seen so far the brain still distinguishes. The mechanism that lets knowledge grow is purely local and already built into the model: homeostasis (adaptive threshold — whoever fires gets "tired") recruits free neurons for each new pattern instead of overwriting occupied ones; lateral inhibition + weight normalization keep the code sparse and selective (little interference). The fatigue state decays slowly, so permanently occupied neurons stay protected while transiently firing ones recover (a free reserve).
| Stream | Coverage over the patterns seen so far (1…14) | Final (mean) |
|---|---|---|
| WITH homeostasis | 1 2 3 4 5 6 7 8 9 9 10 8 10 13 — follows the diagonal | ~10/14 |
| WITHOUT (control) | 1 2 3 4 5 5 6 6 6 5 5 6 6 6 — capped | ~6/14 |
Under capacity pressure (14 patterns, only 50 neurons) the brain continually expands its knowledge through recruitment and holds ~10 concepts; without the mechanism, new patterns overwrite the old ones and it stays at ~6. That is the core ability of a continually learning brain — learning more without unlearning — made visible with purely local rules.
docker compose run --rm build ./build/axo --phase J
The high point: no longer a single mechanism, but everything together in one living loop. A little creature lives in a 7×7 world and learns over one continuous life (24,000 steps, no episode reset) to hunt from the consequences of its own actions — purely local:
sensory area (egocentric place cells: where is food relative to me?)
→ motor area (4 actions, winner-take-all)
→ reward on approaching/eating → reward-modulated plasticity (R-STDP)
| Food per 3000 steps | |
|---|---|
| random baseline (untrained) | 4 |
| life (learning): 143 → 434 → 441 → 415 → 420 → 394 → 365 → 380 | ~400 |
The creature starts essentially helpless (≈ random) and learns to hunt by itself — the food rate rises by a factor of ~95. At the end of life it greedily solves 97.7% of all start/food configurations, so it has learned the rule "move toward the food," not just one path. Perceiving, acting and learning in a single, continuous, purely local loop — no backprop, no central optimizer, no reset. So that the creature doesn't stay stuck forever in a rare policy trap, food "spoils" after a while (reappears) — the rest is what the brain builds from its own experience.
docker compose run --rm build ./build/axo --phase K
Instead of a hard-wired place-cell encoding, the being learns its perception by itself — and acts on it. From the egocentric retina (where is the food relative to me?), two separate spatial channels self-organize via STDP: an x-area over the columns (food left/right) and a y-area over the rows (up/down). Each forms its own place-cell map purely unsupervised (lateral inhibition, no fatigue). A "critical period" first develops vision (random exploration), then the motor (R-STDP) learns to hunt on the disentangled, self-learned code:
retina → x-area (STDP place cells) ┐
y-area (STDP place cells) ┴→ motor (4 actions, R-STDP)
| Food per 3000 steps | |
|---|---|
| random baseline (untrained) | 3 |
| life (learning): 1 → 50 → 171 → 248 | ~156 |
The being hunts on its self-learned vision — food rate ~52× above random, and greedily it solves 94% of all start/food configurations. The key was the disentangling: a conjunctive vision map ("food is in the NE cell") is not controllable by a single local motor layer (the same wall as in deep learning, Phase F/G) — separate axis channels, on the other hand, are, exactly like the disentangled place cells in Phase J, except that here the perception is learned. Perception and control, both purely local, no backprop.
docker compose run --rm build ./build/axo --live # one life phase; call it repeatedly
Here the individual phases become a single, coherent being with a continuous biography. Each
call of --live is one life phase: the being awakens, shows its current abilities, lives &
learns for 6000 steps, sleeps and saves its entire state. Across container/process
restarts it continues where it was — the whole brain state lives in being/ (sx.bin, sy.bin,
motor.bin, life.txt; gitignored).
It unifies everything so far:
- lives continuously (persistence, stage 0) — no reset, an ongoing biography (age, meals, life phases),
- sees with self-learned eyes (stage 1): at birth it opens "its eyes for the first time" — two critical periods first develop vision, then the hunting skill,
- acts on this vision and gets better over its life (R-STDP),
- lives off its own hunger (stage 1, drives): an energy budget drives it; the energy is part of its biography and persists with it,
- explores out of curiosity (stage 2): when full, it doesn't keep hunting doggedly but explores its world out of habituation curiosity — it covers all cells over its life,
- avoids danger: a hazard cell costs energy; after the first pain it remembers it and avoids it.
- categorizes objects with grounded symbols & anticipates (stage 3 + expectation integrated): food comes as nutritious vs poisonous (each with a distinct appearance). A symbol area formed unsupervised at birth (purity 1.0, template readout) recognizes the type; the being learns the value of each symbol from its energy consequences. And it senses the type already at appearance and doesn't even approach poisonous food (anticipation, Phase X) — it spares itself the trips to the poison. This knowledge persists: born naive, it awakens experienced.
So it lives richly: hunt when hungry, explore when full, avoid danger, sense and avoid poison already at appearance — out of hunger, curiosity, learned values and prediction, all in one being. The anticipation gain is measurable: over four lives the meals rise 86 → 108 → 121 → 122 (previously ~57–74 without sensing), while hardly any poison is eaten anymore (~13/life, mostly born-naive/exploration) and ~190 poisonous foods per life are anticipatorily avoided — without walking up to them. Curiosity (49/49) and danger avoidance stay intact along the way. (Honestly: the symbol/value binding uses the same supplied structure as Phase S/T/X; a starving being still takes the occasional risk.)
Evidence (separate process restarts): the being awakens in the state in which it fell asleep:
| Life | awakens | meals (hungry/full) | explored | danger (early→late) |
|---|---|---|---|---|
| 1 (birth) | full | 96 (86 / 10) | 49/49 | 1 → 0 |
| 2 | hungry (8) | 112 (95 / 17) | 49/49 | 1 → 0 |
It hunts mainly when hungry, explores its whole world when full, and learns to avoid the danger (steps on it once, then never again). The earlier pure-hunger version:
| Life | awakens with | eaten (hungry/full) | sleeps with |
|---|---|---|---|
| 1 (birth) | full | 117 (110 / 7) | 50 (full) |
| 2 | 50 (full) | 150 (127 / 23) | 38 (hungry) |
A being that lives, sees, remembers and acts on its own drive — all from its own experience,
learned purely locally. The foundation that the next stages dock onto (ROADMAP.md).
docker compose run --rm build ./build/axo --watch
(See the animation at the top.)
Watch Axo, the living being, in real time (colored ASCII, English display, one frame per
step) — in its 12×10 world with three hazard cells (each learned individually through
pain). It immediately loads your saved being (from being/) — otherwise one is born (with
visible progress). Ctrl-C exits cleanly at any time — and the session counts toward Axo's ONE
life: brain, value memory and age are saved to being/. While you watch, it keeps learning
(R-STDP motor while hunting, food value from the consequence — poison now tastes the same as in
--live too: energy down, value learned).
Visible is the whole world and the brain:
Axo — a living being age 12640 step 25 meals 2 (lifetime 96)
energy [####------] 44 (hungry)
. . . . O . . . . . . O Axo * food x poison ! hazard
. . . . . . . . . ! . | hungry — hunting food >
. . * . . . . . . . .
== Brain (firing neurons) ==
Eye x:..+###+.. y:..+###+.. (place cells: where is the food)
Motor: N[####]< E[ ] S[ ] W[ ] -> N (direction)
Smell: food smells GOOD -> eat (symbol area)
Drive: hunger [######----] -> HUNT
You see everything: energy/hunger, hunting vs curious exploring (when full), the
danger (!, which it avoids), and above all the anticipation — when poisonous food
appears, an x briefly blinks with "smells POISON — refuses to even go there". In the brain
panel the eye place cells (food direction) fire live, along with the motor action neurons
(winner marked), the smell/value symbol (GOOD/TOXIC) and the hunger drive.
Note:
--liveand--watchshare world size (12×10) and brain (48 place cells per vision axis, 8 symbol neurons). An older being from a smaller world doesn't fit the new retina and is reborn on the first start.Why not bigger? The percept scales (food direction is still linearly decodable to 0.95 even at 21×21), but the local R-STDP motor (REINFORCE at its core) no longer reaches hunting competence beyond ~12×10 — an honest, open research wall (details in
ROADMAP.md, section "Motor scaling").
docker compose run --rm build ./build/axo --phase S
First step toward symbols anchored in one's own experience — no LLM talk, but a discrete
inner token that makes a nonlinear category actionable. Task: an object has two sensory
features (encoded noisily over the retina); the APPROACH/AVOID category is
feature1 XOR feature2 — i.e. not linearly readable from the raw senses.
Honest metric, fixed in advance (not fakeable):
| measured | result | |
|---|---|---|
| M1 emergence & correspondence | purity / NMI token↔combo (labels only for evaluation) | 0.99 / 0.95 ✓ |
| M2 causal grounding — symbol agent | correct-action rate | 0.99 ✓ |
| M2 ablation — raw agent (linear) | ditto | 0.52 (random) |
| M2 control — random-token agent | ditto | 0.51 (random) |
Passed — and both hurdles count:
- Two feature areas self-organize unsupervised (STDP + WTA + decaying habituation) one clean detector per feature value each (purity 0.99). Token = bound pair of the winners.
- The raw agent at 0.52 proves: the category is genuinely nonlinear — not actionable without a symbol. The random token at 0.51 proves: it's the meaning of the token, not just "an extra input". The symbol agent solves the task (0.99).
Honest about the scope: the feature detectors emerge (unsupervised); the binding of
the two into a combo token is supplied as a quantized index — the same "disentangle +
quantized relay" bias as the x/y axes in vision (Phase K) and the pairing in Phase Pc. The motor
solves the XOR nonlinearity via the clean token. Lesson from the build path (4 attempts): the
monopoly of competitive spiking cells is only broken by the decaying habituation
(tau_theta≈1500, like Phase I), and action selection needs the rate-based LinReadout
instead of a 2-neuron spiking WTA (WTA pathology, like Phase F/Pc) — both fixes came out of the
project's own findings.
docker compose run --rm build ./build/axo --phase T
From the single symbol (stage 3) to the temporal sequence. Task over 3 symbols {A,B,C}:
go = (symbol2 == successor(symbol1)) in the cycle A→B→C→A. This depends on both symbols and
their order (e.g. AB → go, BA → no).
| Agent | sees | correct rate |
|---|---|---|
sequence (ordered token [tok1·S+tok2]) |
both symbols + order | 0.99 ✓ |
| memoryless | only the second symbol | 0.46 |
| bag | both symbols, without order | 0.53 |
Passed — and both ablations count: the sequence agent solves the rule (0.99), while both controls clearly fail (even below the trivial baseline): the memoryless agent needs the memory of the first symbol, the bag agent needs the order.
The key that cracked >2 symbols (purity 0.64 → 0.99): separating learning from readout.
The receptive fields arise via spiking STDP+WTA+habituation (emergent), but the token readout
runs through a deterministic template match (argmax_j w_j·input over the learned fields),
not through a spiking argmax — the latter is destabilized by the habituation accumulating within
the window, which collapses at >2 modes. With the decoupling (+ cells ~4·K) the area clusters 3
modes cleanly (purity 0.99).
Honest about the scope: the binding of the two time steps into an ordered token is supplied time slotting (a "first/then" register, like the feature slotting in S). And: K=4+ still degrades (purity ~0.67–0.71 even with more cells) — the coverage problem of competitive cluster formation grows with K; getting K≥4 clean (init seeding / conscience tuning) is open work. Shown: K=2 and K=3 clean, K≥4 open.
docker compose run --rm build ./build/axo --phase X
From the sequence to prediction: a cue (A/B) at the appearance of the food predicts its value (A → nutritious, B → poisonous). The food is only reachable after a path — the cue lies in the past, the decision in the present, and the appearance on arrival is not value-predictive. Only by holding the cue across the time gap can one decide correctly in anticipation.
| Agent | decides from | correct rate |
|---|---|---|
| expectation (holds the past cue) | the past | 1.00 ✓ |
| memoryless | only the (non-predictive) present | 0.48 (random) |
Cue-symbol purity 1.00 (unsupervised, template readout). Passed: the expectation agent
acts from prediction (1.00), the memoryless one stays at random — the value-relevant
information lay in the past, so working memory is needed. Honest about the scope: the
value-from-consequence is learned; holding the cue is a supplied latch (a register, like the
time slotting in T). This is the validated mechanism — embedding it into the living --live
being (avoiding poison already at appearance, without walking up to it) is the next step.
docker compose run --rm build ./build/axo --phase L
The step from the trained hunter (designer reward "closer = good") to a being with its own drive. It gets an energy/hunger budget: energy drops with every step (metabolism), eating refills it. The drive arises internally from the hunger — biologically incentive salience: when the being is full, food is barely appealing (retina dark) → it doesn't hunt; when it gets hungry, food lights up → it hunts. A critical period first learns the hunting skill (full salience), after which the drive steers the behavior.
| energy (mean) | famines (early→late) | eats hungry vs full | |
|---|---|---|---|
| WITH drive | 48 (healthy middle) | 14 → 6 (learns to survive) | 0.049 vs 0.010 (~5× more when hungry) |
| without drive | 95 (overeats) | 0 → 0 | 0.088 vs 0.149 (state-blind) |
The drive being regulates itself: it keeps its energy in a healthy middle and eats mainly when it is hungry — no more designed reward signal, the behavior springs from an inner need. The being without a drive doggedly overeats. A step toward the "self": the being acts of its own accord.
docker compose run --rm build ./build/axo --phase P
A controlled depth test instead of MNIST (which is nearly linear and never needs depth). N-bit parity (XOR over N bits) is linearly unsolvable, and with a small hidden width even one layer fails at larger N — the "open road" on which a depth benefit can become visible at all. Measured with the right ruler: the separability of the hidden code (offline perceptron), not the noisy spiking readout.
| N (combos) | flat (linear) | 1-hidden | 2-hidden (layer-wise) |
|---|---|---|---|
| 2 (XOR) | 2/4 (fail) | 4/4 (solves) | – |
| 4 (parity) | 8/16 | 5/16 (fail) | 6/16 (fail) |
Two findings, cleanly shown:
- The task forces depth: the same 1-hidden layer that fully solves XOR (4/4) fails at 4-bit parity (5/16, below random). One layer is not enough.
- The spiking multilayer coupling does NOT deliver the depth: the second layer doesn't catch the drop (6/16 ≈ 5/16). "Depth helps" is not demonstrable here — not because the stage is missing, but because the coupling wall is real.
Methodological lesson (learned the hard way): the behavioral readout sells a working layer short — it shows only 2/4 (50%) for XOR, even though the hidden code is 4/4-separable. Only the separability measurement reveals what was really learned; a single noisy number can mislead. Together with Phase F/G, "real deep learning" is achieved with one learned layer; multilayer remains the open research wall.
docker compose run --rm build ./build/axo --phase Pc
The multilayer wall (Phase P) cracked through decomposition:
parity-4 = XOR(XOR(b0,b1), XOR(b2,b3)) — two XOR modules (each on one bit pair) plus a
spiking combiner, the same "disentangle" lesson as in vision (Phase K). The decomposition
(pairing + subgoals) is the supplied inductive bias, like the x/y axes in vision.
| Path | Result |
|---|---|
| module A / B (XOR on 1 pair each) | 4/4 / 4/4 |
| ensemble relay A / B (majority over modules) | [0110] / [0110] (clean) |
| B) combiner over learned modules → quantized relay | 16/16 behavioral, 16/16 separable |
C) oracle relay [0110] (control) |
16/16 separable (12/16 behav., readout seed) |
A) combiner over distributed code [codeA|codeB] |
6/16 (fails) |
The real path solves parity-4 with 16/16 (learned XOR modules → ensemble-decoded relay → spiking FA combiner; deterministically reproduced). Three ingredients are needed — and that is the payoff:
- Robust base operation. Decisive was the exact Phase-F XOR config (16 active
neurons/bit,
w_norm=0.2·N,present=60). With a weak input (8/bit,w_norm=0.5·D) the base XOR is only marginally separable and composition fails (8/16). The wall was never "depth" but the robustness of the base operation. Marginal XOR can't be stacked, solid XOR can. - Quantized inter-area relay. Over the distributed code it keeps failing (A: 6/16) — the signal between the areas must be a quantized decision bit (like labeled-line spikes in the brain), not a raw distributed code. An ensemble (population coding, "many neurons") makes the relay bit reliably clean.
- Compositional structure instead of a monolithic layer that is supposed to disentangle the entangled whole.
How this finding came about (an honest research arc): an initial run reported "13/16 cracks it" — an artifact (the online readout collapsed to a constant relay
[1111], the offline probe overfit noise). The correction then yielded an apparently decisive negative finding (8/16, "XOR fundamentally marginal") — which was, however, config-dependent: with the weak input. Only the robust Phase-F config revealed the true picture: 16/16. Lesson to myself: behavioral + oracle control + high repetition against measurement artifacts — and check negative findings for config dependence before calling them "fundamental".
docker compose run --rm build ./build/axo --phase M
Stage 2: intrinsic motivation that emerges from a neuronal building block instead of a
hand-coded counter. No food, no goal, no reward — the novelty is the homeostasis (habituation,
theta) in a spiking place-cell area: visited places habituate (their place signal goes
weak), and the being is drawn toward the un-habituated (the new). World: two rooms connected by
a single door; walls are discovered as a map.
| Step | 300 | 800 | 1500 | 6000 |
|---|---|---|---|---|
| curious | 85 | 85 | 108 | 111 |
| random | 47 | 55 | 56 | 111 |
The curious being finds the door and covers the world much faster than a random wanderer that stays trapped in the starting room (@800: 85 vs 55 of 111 cells). The decisive emergent addition: because habituation slowly decays, the old becomes new again → the being stays curious for life. In the last third of its life it still visits 95 distinct cells (vs 56 for random) — it never "ends" the exploration but keeps its world fresh. A (monotonic) counter couldn't do that. Nicely: the same homeostasis that protects knowledge from forgetting in Phase I drives the curiosity here.
Honestly: the novelty drive is now neuronal/emergent (habituation), and the lifelong
behavior follows from the theta decay dynamics. The action selection still explicitly compares
the habituation of the neighboring cells (not a pure spike reflex). A first attempt to drive this
purely via an R-STDP motor failed instructively: a reward-driven policy converges to a fixed
habit and then precisely cannot explore anymore — curiosity needs a never-freezing driving
force, and the decaying habituation provides it.
Associative choice: 4 stimuli, 4 actions, immediate reward. Runs in seconds. docker compose run --rm build ./build/axo --phase A
Result: the moving hit rate rises from ~25% (random) to ~99% — the agent learns, via reward-modulated plasticity (eligibility traces + global dopamine signal d(t)), to choose the right action per stimulus. Two-area mini-brain: sensory input → motor area (winner-take-all), learning via the reward signal. Next stage: delayed reward (1D food search).
docker compose run --rm build ./build/axo --phase B
The agent senses its position on a track (food in the middle), moves left/right and only gets the reward at the food. Via eligibility traces, R-STDP distributes the delayed reward back onto the steps that led there. With learning-rate annealing it is stably ~99% successful, mean steps ~2.3 (optimum ~2.0). First step with temporal credit assignment; next stage: variable food position / 2D.
docker compose run --rm build ./build/axo --phase D
The food changes every episode. The agent perceives position AND food via two place-cell populations and learns the relational rule "move toward the food". Greedy eval over ALL (position×food) pairs: 42/42 solved (100% generalization) — the brain learns a rule, not just a mapping.
Finding: episode-wide sparse reward fails here (a random policy reaches the food ~50% of the time anyway → no learning gradient). Only reward shaping (immediate reward per step: closer to the food = +1) gives a clean signal. A classic RL principle, made visible.
No external dataset needed, runs in seconds.
- Run: docker compose run --rm build ./build/axo --phase C
- Heatmap: docker compose run --rm viz bash -c "pip install -q -r viz/requirements.txt && python viz/plot_selectivity.py selectivity_phaseC.bin viz/out/selectivity.png"
Result: the network organizes its neurons by itself so that each of the 6 patterns is represented
by its own selective neurons — patterns_covered = 6/6, mean_selectivity ≈ 0.98. This is the
validated proof that the brain forms structure from a raw spike stream unsupervised. These
stabilized learning parameters (lateral inhibition + weight normalization) carry over into Phase A
(agentic, R-STDP).
- Place MNIST in
data/(train-images-idx3-ubyte, train-labels-idx1-ubyte). - Training: docker compose run --rm build ./build/axo --phase 3
- Fields: docker compose run --rm viz bash -c "pip install -q -r viz/requirements.txt && python viz/plot_receptive_fields.py weights_phase3.bin viz/out/receptive_fields.png"
- Accuracy: docker compose run --rm build ./build/axo --phase 3-eval
Status: the pipeline runs fully, but digit selectivity still needs hyperparameter tuning (cf. Phase C, where the learning dynamics were cleanly stabilized on controlled patterns). Currently classification accuracy is at random level — deliberately not tuned further, since the focus is on the agentic direction (Phase A).
MIT — see LICENSE.
