# Optimization Plan: Reach 1:1 Parity with CPU/GPU

Goal: Move from a fixed target of 32,768 Hz to true 1:1 parity with available CPU/GPU compute: i.e., use available SIMD/AVX, multi-threading, and GPU offload to keep evolution computation matching raw processor throughput (no artificial Hz limit).

This doc lists prioritized steps, concrete implementation suggestions, metrics, and low-risk changes we can apply immediately.

## Priorities (High → Low)

1. Profiling and hotspot identification
2. Async rendering + IO offload
3. Hardware SHA-256 acceleration
4. SIMD vectorization (AVX2/AVX512) for RK4 math
5. Multi-threading and task partitioning (OpenMP / pthreads / threadpools)
6. Lock-free or reduced-lock state access (atomics, double-buffering)
7. GPU offload (CUDA / OpenCL) for batched RK4 across multiple simulation instances
8. Precision mode tuning (mixed precision): use DP where needed, SP where acceptable
9. GPU kernel auto-tuning & batched workloads
10. CI performance tests & regression benchmarks

## Measurable Success Criteria

- Baseline: V4.2 ~7K Hz (float4096) on current hardware.
- Target 1:1 CPU parity: achieve evolution computation throughput matching a single core/SM throughput baseline for chosen hardware (measured as evolutions/sec per core and total with all cores / GPU SMs).
- GPU parity: benchmark time to process an RK4 step on GPU vs CPU; target >5× speedup per watt/wall-time.
- Maintain functional correctness: CV and phase transitions equivalent within tolerances.

## Actionable Steps (with short-term code changes)

1. Add a profiling harness to run N evolutions and report per-stage times (RK4, feedback, SHA, rendering).
   Deliverable: `bench_opt.c` and `make profile` target.

2. Make rendering async. Move ASCII graph rendering into a thread and use a lock-free ring buffer for state snapshots.
   Deliverable: `async_render.c`, `render_thread()` integration, reduced blocking in evolution loop.

3. Offload SHA-256 to hardware/accelerated library if detected (Intel SHA extensions / OpenSSL EVP with hardware accel). Fallback to current implementation.
   Deliverable: detection logic + compile-time flags.

4. Vectorize math kernels. Replace scalar loops with intrinsics (AVX2/AVX512) or rely on compiler auto-vectorization with aligned data + restrict pointers. Use FMA intrinsics where useful.
   Deliverable: `rk4_simd.c` with AVX2/AVX512 paths and fallback.

5. Add OpenMP build path (fast multi-core) and an optional CUDA/OpenCL path for GPU.
   Deliverable: Makefile flags `-fopenmp` and `cuda` target that builds `analog_codec_cuda` using nvcc if available.

6. Introduce mixed-precision modes:
   - `FAST` mode: single-precision floats + SIMD (max throughput)
   - `RELIABLE` mode: double-precision or GMP for correctness
   Deliverable: runtime flag `--precision fast|double|gmp`.

7. Add unit and regression benchmarks to CI (time-based and accuracy checks).

## Low-Risk Immediate Changes (I'll implement them now)

- Add `OPTIMIZATION_PLAN.md` (this file)
- Add Makefile targets: `optimize`, `profile`, `gpu` (stubs)
- Add `bench_opt.c` skeleton that runs evolutions and reports timing
- Add `async_render.c` skeleton and wire to Makefile

These are minimal, low-risk, and give immediate ability to profile and iterate.

## Longer-Term (once profiling shows hotspots)

- Implement AVX512 intrinsics for RK4 inner loops
- Implement CUDA kernel for batched RK4 across many independent instances (for ML-style batched simulation)
- Implement a GPU/CPU hybrid scheduler
- Implement runtime autotuner to select best code path

## Measurement Plan

- Baseline profile: `make profile; ./bench_opt --iters 1e6`
  Report: time_RK4, time_hash, time_render, time_feedback, overall Hz

- After each optimization: run profile and compute speedup ratios per stage and global.

---

If you want, I'll implement the low-risk changes now (Makefile targets, `bench_opt.c`, `async_render.c`) and run a quick local profile. Say "Go ahead implement low-risk changes" and I'll proceed. If you'd prefer a different ordering (e.g., GPU-first), tell me.