1Compiles any HuggingFace model into a single persistent megakernel (opens in new tab)(twitter.com)2OsamaJaber10d ago0Save
3AutoMegaKernel: Compiling a LLM into a single CUDA kernel (opens in new tab)(arxiv.org)arXiv3OsamaJaber18d ago0Save
4AutoMegaKernel: Compile an LLM into one provably-correct CUDA megakernel (opens in new tab)(github.com)GitHub4OsamaJaber19d ago0Save
5StreamIndex: Memory-bounded compressed sparse attention via streaming top-k (opens in new tab)(arxiv.org)arXiv4OsamaJaber1mo ago0Save
6Show HN: AutoKernel, Auto GPU Kernel Optimization (opens in new tab)(arxiv.org)arXiv2OsamaJaber1mo ago0Save
7DeepSeek V4's indexer dies at 65K. We got it to 1M on 6GB (opens in new tab)(arxiv.org)arXiv5OsamaJaber1mo ago0Save
8AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search (opens in new tab)(arxiv.org)arXiv4OsamaJaber1mo ago0Save
9DeepSeek V4's indexer OOMs at 65K context. We got it to 1M in 6G (opens in new tab)(arxiv.org)arXiv8OsamaJaber1mo ago0Save
10Ouroboros: Dynamic Weight Generation for Recursive Transformers (opens in new tab)(arxiv.org)arXiv2OsamaJaber2mo ago0Save
11Tide: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference (opens in new tab)(arxiv.org)arXiv3OsamaJaber2mo ago1Save
15PicoLM: Run a 1B parameter LLM on a $10 board (opens in new tab)(github.com)GitHub4OsamaJaber4mo ago1Save