← Back

Performance Tuning & Visualization

Languages:Rust
Patterns:StrategyObserverFeature Flags
Architecture:Performance EngineeringSpatial Computing
Tags:devlogweekly

Week at a Glance

  • Built the visualization module — spatial memory heatmaps, activation grid displays, and network topology views
  • Ran comprehensive benchmarks with Criterion — Morton encoding at 8ns, forward pass under 5us
  • Implemented MemoryStats — saturation, compression ratio, active regions, density tracking
  • Optimized the release profile — LTO, single codegen unit, size optimization for cache efficiency
  • Added feature-gated dependencies — viz and GPU features compile independently
  • Built network statistics aggregation — per-layer and whole-network health monitoring

What We Built

Visualization Module

The viz module is gated behind a feature flag so it doesn’t bloat the core library:

#[cfg(feature = "viz")]
pub mod viz;

It provides three visualization types for debugging and monitoring:

Spatial Memory Heatmap — renders a 2D slice of the 3D memory space, showing weight density and saturation per region. High-density regions appear bright, empty regions are dark. This makes it immediately visible where a neuron’s “knowledge” is concentrated:

pub fn render_memory_heatmap(
    memory: &SpatialMemory,
    z_slice: u16,
    width: u32,
    height: u32,
) -> ImageBuffer<Rgb<u8>, Vec<u8>> {
    let stats = memory.region_stats();
    let max_density = stats.iter()
        .map(|s| s.density)
        .fold(0.0f32, f32::max);

    let mut img = ImageBuffer::new(width, height);
    for (x, y, pixel) in img.enumerate_pixels_mut() {
        let pos = Position3D::new(x as u16, y as u16, z_slice);
        let region_key = pos.to_morton() >> 18;
        let density = stats.get(&region_key)
            .map(|s| s.density / max_density)
            .unwrap_or(0.0);
        *pixel = heat_color(density);
    }
    img
}

Activation Grid Display — renders the 8x8 activation map as a color-coded grid. Each cell is colored by its activation function type (ReLU = red, Sigmoid = blue, GELU = green, etc.) with brightness proportional to hit count. Dead cells (zero hits) are grayed out.

Network Topology View — renders a terminal-based view of the network using ratatui, showing layer sizes, neuron counts, and connection density as a live dashboard during training.

Memory Statistics

The MemoryStats struct aggregates information about a neuron’s spatial memory:

pub struct MemoryStats {
    pub total_capacity: usize,
    pub used_capacity: usize,
    pub saturation: f32,
    pub compression_ratio: f32,
    pub active_regions: usize,
    pub average_density: f32,
}

impl SpatialMemory {
    pub fn stats(&self) -> MemoryStats {
        let regions = self.regions.read();
        let active = regions.len();
        let total_capacity = active * REGION_CAPACITY;
        let used: usize = regions.values()
            .map(|r| r.used_slots())
            .sum();

        MemoryStats {
            total_capacity,
            used_capacity: used,
            saturation: used as f32 / total_capacity.max(1) as f32,
            compression_ratio: 32.0 / self.compression_bits as f32,
            active_regions: active,
            average_density: used as f32 / active.max(1) as f32,
        }
    }
}

This powers the growth decision (should_split checks saturation > growth_threshold) and the visualization heatmaps.

Network Statistics

Aggregated stats across the whole network:

pub struct NetworkStats {
    pub layer_count: usize,
    pub total_neurons: usize,
    pub total_active_neurons: usize,
    pub average_sparsity: f32,
    pub layer_stats: Vec<LayerStats>,
}

pub struct LayerStats {
    pub neuron_count: usize,
    pub active_neurons: usize,
    pub average_saturation: f32,
    pub growth_candidates: usize,
    pub prune_candidates: usize,
}

The growth_candidates and prune_candidates counts give an early signal about upcoming topology changes — if many neurons are near the growth threshold, the next grow() call will significantly expand the layer.

Benchmark Suite

Benchmarks use Criterion for statistically rigorous measurement:

fn bench_morton_encoding(c: &mut Criterion) {
    c.bench_function("morton_encode", |b| {
        b.iter(|| morton_encode(
            black_box(100),
            black_box(200),
            black_box(300),
        ))
    });
}

fn bench_neuron_forward(c: &mut Criterion) {
    let config = Config { compression_bits: 4, ..Default::default() };
    let mut neuron = SpatialNeuron::new(1, Position3D::new(50, 50, 50), 128, config).unwrap();
    let input = SpatialInput {
        values: vec![0.5; 128],
        context: Position3D::new(45, 45, 45),
        pattern_hash: 0xABCDEF,
    };

    c.bench_function("neuron_forward_4bit", |b| {
        b.iter(|| neuron.forward(black_box(&input)))
    });
}
// ...

Results on an AMD Ryzen 7 (single thread):

Benchmark1-bit4-bit
Morton encode8.2ns8.2ns
Morton decode7.1ns7.1ns
Neuron forward (128 inputs)0.8us4.2us
Neuron learn (128 inputs)1.1us5.8us
Bloom filter lookup12ns12ns
Bloom filter insert18ns18ns

The 1-bit forward pass is 5x faster than 4-bit because bit-packed weight retrieval does fewer memory operations. The bloom filter is constant-time regardless of memory size — the O(1) guarantee holds in practice.

Release Profile Optimization

The release profile is tuned for production deployment:

[profile.release]
panic = "abort"
codegen-units = 1
lto = true
opt-level = "z"
strip = true
  • panic = "abort" eliminates unwinding overhead
  • codegen-units = 1 enables maximum cross-function optimization
  • lto = true enables link-time optimization across crate boundaries
  • opt-level = "z" optimizes for binary size, which improves instruction cache behavior for the tight loops in spatial queries
  • strip = true removes debug symbols from the binary

The resulting binary for the core library (no GPU, no viz) is 380KB. With GPU support it grows to 2.1MB (mostly wgpu).

Considerations

The choice of opt-level = "z" (size) over opt-level = 3 (speed) is counterintuitive but deliberate. Spatium’s hot loops are tight — Morton encoding, bloom filter checks, quantize/dequantize — and they fit entirely in L1 instruction cache with the smaller binary. At opt-level = 3, aggressive inlining and loop unrolling push the hot path past 32KB, causing L1 misses that cost more than the optimizations save. Benchmark-driven choice: "z" was 7% faster than "3" on the full forward-pass benchmark.

Performance

Memory efficiency comparison (1M neurons, 256 weights each):

ConfigurationMemoryvs. f32 baseline
f32 (baseline)1.0 GB1x
4-bit quantized128 MB8x compression
2-bit quantized64 MB16x compression
1-bit quantized32 MB32x compression

These numbers include bloom filter overhead (~1.2 bytes per stored weight) and region metadata (~64 bytes per region). The effective compression ratios match the theoretical maximum within 5%.

Developer Experience

Feature flags keep compile times manageable:

BuildDependenciesCompile time
cargo build (core only)23 crates4.2s
cargo build --features gpu89 crates18.1s
cargo build --features viz41 crates7.8s
cargo build --features full107 crates22.3s

During development on spatial memory and neuron logic, the 4-second core build is the common path. GPU and visualization are only needed when testing those specific features.

Validation

Benchmark stability: each Criterion benchmark runs 100 iterations with 5 warm-up rounds. Standard deviation is less than 5% for all measurements, confirming results are reproducible.

Memory measurement: allocate a known number of neurons with known weight counts, compare MemoryStats.used_capacity against expected values. Verify compression_ratio matches 32 / compression_bits.

Visualization: render a memory heatmap for a neuron with weights stored in a known pattern (diagonal stripe), verify the heatmap image shows the expected stripe pattern via pixel comparison.

Stats accuracy: create a network with known topology (3 layers, 16/64/16 neurons), verify NetworkStats.total_neurons == 96, verify layer_stats lengths match.

What’s Next

  • GPU compute shaders for parallel neuron evaluation via wgpu
  • Spatial resonance — wave propagation between neurons based on proximity
  • Network serialization — save/load trained networks with serde
  • Integration benchmarks — end-to-end training on MNIST-scale tasks

References