Projects
Educational implementations built from scratch to understand how things work under the hood.
- Flash Attention (CUDA) CUDA
Minimal CUDA implementation of Flash Attention with tiled computation and online softmax. Compares a CPU reference, a naive GPU kernel, and the memory-efficient Flash kernel side by side.
- Transformer from Scratch PyTorch
~30M-parameter GPT-style decoder trained on FineWeb-Edu. Plain PyTorch in ~1000 lines of code.