summaryrefslogtreecommitdiff
path: root/ggml/src/ggml-cuda/template-instances
AgeCommit message (Collapse)Author
2024-11-21MMQ for Q6_0 (#115)Kawrakow
* MMQ for Q6_0 * Add Q6_0 MMQ to template generator --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-22Enable q6_0 for flash attention (#101)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-21Enable IQ4_NL for KV-cache in token generation using Flash Attention (#99)Kawrakow
* Enable IQ4_NL for V-cache in token generation * We don't need these * Update printour of allowed quantized KV-cache combinations * Add IQ4_NL + IQ4_NL to FA This is a better alternative than Q4_0 + Q4_0 for the VRAM poor. * Remove file added by mistake * Fix typo, which is not really a bug --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-27Merge mainline llama.cpp (#3)Kawrakow
* Merging mainline - WIP * Merging mainline - WIP AVX2 and CUDA appear to work. CUDA performance seems slightly (~1-2%) lower as it is so often the case with llama.cpp/ggml after some "improvements" have been made. * Merging mainline - fix Metal * Remove check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>