Reverse-Engineered APIs Unlock Neural Training on Apple Neural Engine
maderix/ANE enables backpropagation directly on Apple's powerful ANE hardware, bypassing CoreML and GPU for pure accelerator compute.
In a breakthrough for Apple Silicon developers, the maderix/ANE project delivers a from-scratch implementation of transformer training—complete with forward and backward passes—running natively on the Apple Neural Engine (ANE). This specialized hardware, clocking up to 15.8 TFLOPS on M4 chips, has long been reserved by Apple for inference only, leaving its immense potential for training untapped. By reverse-engineering private APIs like _ANEClient and _ANECompiler, along with the opaque MIL (Model Intermediate Language) format, ANE sidesteps official restrictions, enabling custom compute graphs on the ANE without relying on CoreML training tools, Metal shaders, or even the GPU.
The core innovation lies in crafting six specialized ANE kernels per training step, pushing the boundaries of what's possible on this inference-optimized accelerator:
kFwdAttn: Handles RMSNorm, QKV projections, scaled dot-product attention (SDPA), and output projection.kFwdFFN: Executes RMSNorm and SwiGLU feed-forward networks.kFFNBwd: Computes FFN backward passes with transposed weights.kSdpaBwd1andkSdpaBwd2: Break down SDPA backpropagation into manageable chunks for dV, probabilities, and gradients on Q/K.kQKVb: Finalizes QKV backward to derive input gradients (dx).
On an M4 Mac, a single transformer layer (dim=768, seq=512) achieves 9.3 ms per step at 11.2% ANE utilization—sustaining 1.78 TFLOPS—with just six kernel dispatches. While forward and most backward passes (dx) run on ANE, CPU handles RMSNorm backward, residuals, loss computation, dW gradient accumulation via Accelerate's cblas_sgemm, and Adam optimization. This hybrid approach maximizes ANE's strengths in matrix-heavy ops while leveraging CPU for sequential logic.
What makes ANE technically mesmerizing are its ruthless optimizations, born from deep dives into ANE's quirks:
- Channel-first CPU layouts matching ANE's IOSurface format
[1,C,1,S], slashing transpose overhead to zero. - vDSP-accelerated RMSNorm, boosting speed 10x (from 6.7ms to 0.7ms).
- GCD-async cblas overlap, parallelizing gradient computations with ANE evals on a serial dispatch queue.
These tweaks expose ANE's raw efficiency for training workloads, a feat previously unthinkable given Apple's lockdown. For machine learning engineers targeting Apple hardware—think on-device fine-tuning of LLMs or edge models—this opens a path to sub-10ms steps without power-hungry GPUs. It's not full end-to-end ANE training yet (dW and optimizer remain CPU-bound), but the project's checkpointing, gradient accumulation, and resume features make it production-ready for experiments.
Gaining explosive traction in just days, ANE spotlights the developer hunger for untapped Silicon potential. Written in Objective-C for low-level control, it invites tinkerers to extend kernels or scale to multi-layer transformers. As Apple doubles down on ANE in future chips, projects like this could redefine local ML training, proving reverse-engineering can turn black-box hardware into a trainable powerhouse.
(Word count: 462)