summaryrefslogtreecommitdiff
path: root/ggml-cuda
AgeCommit message (Expand)Author
2024-05-21CUDA: deduplicate mmq code (#7397)Johannes Gäßler
2024-05-18CUDA: deduplicate FlashAttention code (#7352)Johannes Gäßler
2024-05-18cuda : add half2 __shfl_xor() for ROCm 5.5 (#7263)Engininja2
2024-05-17CUDA: faster large batch FA without tensor cores (#7314)Johannes Gäßler
2024-05-15ggml : add `ggml_upscale_ext` (ggml/814)John Balis
2024-05-12CUDA: add FP32 FlashAttention vector kernel (#7188)Johannes Gäßler
2024-05-11feat: implemented sigmoid function (ggml/806)Justina Cho
2024-05-11ggml : full ALiBi support (#7192)Georgi Gerganov
2024-05-09CUDA: generalize FP16 fattn vec kernel (#7061)Johannes Gäßler
2024-05-08Introduction of CUDA Graphs to LLama.cpp (#6766)agray3
2024-05-01CUDA: CUDART < 11.7 workaround for __hmax, __hmax2 (#7019)Johannes Gäßler
2024-04-30ggml : add Flash Attention (#5021)Georgi Gerganov
2024-04-29Fix more int overflow during quant (PPL/CUDA). (#6563)DAN™
2024-04-18ggml : group all experts in a single ggml_mul_mat_id (#6505)slaren
2024-04-09llama : add Command R Plus support (#6491)Carolinabanana
2024-04-03ggml : mul_mat_id use the same tensor for all the experts (#6387)slaren
2024-03-29sync : ggml (#6351)Georgi Gerganov
2024-03-26IQ1_M: 1.75 bpw quantization (#6302)Kawrakow
2024-03-25cuda : fix LLAMA_CUDA_F16 build (#6298)slaren
2024-03-25cuda : refactor into multiple files (#6269)slaren