summaryrefslogtreecommitdiff
path: root/ggml/src/ggml-cuda
AgeCommit message (Expand)Author
2025-06-05IQ1_M_R4 CUDA implementation (#494)Kawrakow
2025-06-05MMQ implementation for IQ4_KS_R4 and IQ5_KS_R4 (#493)Kawrakow
2025-06-05CUDA implementation for IQ1_S_R4 (#492)Kawrakow
2025-06-01Minor (~2%) iq2_ks TG performance improvement on CUDA (#468)Kawrakow
2025-05-27CUDA GEMM and GEMV for IQ4_KS_R4 and IQ5_KS_R4 (#462)Kawrakow
2025-05-26CUDA implementation for IQ2_K_R4, IQ3_K_R4, IQ4_K_R4, IQ5_K_R4 (#461)Kawrakow
2025-05-24Legacy quants conversion schemes in convert_hf_to_gguf.py (#449)Nexes the Elder
2025-05-23Fix bug in MMVQ kernel (#446)Kawrakow
2025-05-23Trellis quants with CPU inference (#441)Andrew Chan
2025-05-20Bug fixes from mainline (#439)Kawrakow
2025-05-18Forgotten MMQ ref and typo (#431)Nexes the Elder
2025-05-15Adding forgotten template instance for iq5_ks (#424)Kawrakow
2025-05-15Adding IQ5_KS - 5.25 bpw quants (#422)Kawrakow
2025-05-15CUDA: quantized GEMM for for IQ2_KS, IQ2_K, IQ3_K (#418)Kawrakow
2025-05-14CUDA: quantized GEMM for for IQ4_K, IQ5_K, IQ6_K (#417)Kawrakow
2025-05-14Fix SER (CUDA) (#416)Kawrakow
2025-05-12Fix new CUDA FA on Touring (#413)Kawrakow
2025-05-12Faster DeepSeek FA on CUDA (#408)Kawrakow
2025-05-11Revert "Fix race in the CUDA DeepSeek FA kernel (#406)"Iwan Kawrakow
2025-05-11Fix race in the CUDA DeepSeek FA kernel (#406)Kawrakow
2025-05-10TG improvements for MoE models (#404)Kawrakow
2025-05-09Fix CUDA FlashMLA-3 with quantized KV cache (#400)Kawrakow
2025-05-07FlashMLA-3 for DeepSeek models on CUDA (#386)Kawrakow
2025-05-05Fix DeepSeek FA (#382)Kawrakow
2025-05-04CUDA: MMQ for IQ4_KS (#374)Kawrakow
2025-05-04CUDA: faster FA TG for GQA models (#370)Kawrakow
2025-04-24cuda: use switch in constexpr funcs (#343)Kawrakow
2025-04-15Allow q8_0 KV cache for head size 256 (#330)Kawrakow
2025-04-07Add copyright notices (#317)Kawrakow
2025-03-18Make Q8_0 KV cache work with mla=2,fa on CUDA (#264)Kawrakow
2025-03-18Compile time option to use bf16 for qunts without MMQ kernels (#261)Kawrakow
2025-03-18FlashMLA-2: reduce compute buffer size (CUDA and CPU) (#260)Kawrakow
2025-03-12MLA-2: Allow usage of q8_0 for KV cache on CUDA (#252)Kawrakow
2025-03-10DeepSeek imatrix stuff (#250)Kawrakow
2025-03-10Faster MoE token generation on CUDA (#248)Kawrakow
2025-03-05DeepSeek CUDA Flash Attention (#241)Kawrakow
2025-03-02SER - Smart Expert Reduction (#239)Kawrakow
2025-03-01Reduce size of compute buffers (#237)Kawrakow
2025-02-27Option to use MLA without a transposed cache (#235)Kawrakow
2025-02-27Faster MLA on CUDA (#234)Kawrakow
2025-02-23Fused MoE ffn_up and ffn_gate (#229)Kawrakow
2025-02-07cuda: non-contiguous rms norm (#190)Kawrakow
2024-11-21MMQ for Q6_0 (#115)Kawrakow
2024-10-31Faster MoE inference (#112)Kawrakow
2024-10-26Bitnet CUDA improvements (#109)Kawrakow
2024-10-25Bitnet changes (#106)Kawrakow
2024-10-24Fix quantized k-cache without FA (#105)Kawrakow
2024-10-22Enable q6_0 for flash attention (#101)Kawrakow
2024-10-21Enable IQ4_NL for KV-cache in token generation using Flash Attention (#99)Kawrakow
2024-10-16Adding IQ4_KSS: 4.0 bpw quants (#89)Kawrakow