Co-created metal-flash-attention implementation that influenced fast matrix multiplication program on Apple GPU.
How media typically covers Philip Turner
Referenced in coverage
An optimized matrix multiplication implementation on Apple GPU achieves performance parity with Apple's Metal Performance Shaders through use of undocumented simdgroup_async_copy instruction, reaching 2.5 trillion 32-bit floating point operations per second on a 2022 MacBook Air.
“Co-created metal-flash-attention implementation that influenced fast matrix multiplication program on Apple GPU.”