Projects

Educational implementations built from scratch to understand how things work under the hood.

Flash Attention (CUDA) CUDA
Minimal CUDA implementation of Flash Attention with tiled computation and online softmax. Compares a CPU reference, a naive GPU kernel, and the memory-efficient Flash kernel side by side.
Distributed Transformer PyTorch · DDP · FSDP
Decoder-only Transformer training with PyTorch DDP and FSDP. Reproducible experiments for throughput, memory pressure, checkpoint overhead, and when to choose single-GPU, DDP, or FSDP.
LLM Inference Benchmarking PyTorch · Inference
Reproducible LLM serving benchmarks comparing static Hugging Face batches vs continuous KV-cache scheduling. Measures TTFT, ITL, throughput, GPU memory, and KV-cache growth on real PyTorch paths.