summaryrefslogtreecommitdiff
path: root/ggml-cuda.cu
AgeCommit message (Expand)Author
2023-10-10llm : add MPT support (#3417)Jan Ploski
2023-10-08sync : ggml (ggml-backend) (#3548)Georgi Gerganov
2023-09-30ggml-cuda : perform cublas mat mul of quantized types as f16 (#3412)slaren
2023-09-28llama.cpp : split llama_context_params into model and context params (#3301)slaren
2023-09-28llama : custom attention mask + parallel decoding + no context swaps (#3228)Georgi Gerganov
2023-09-28ggml-cuda : perform cublas fp16 matrix multiplication as fp16 (#3370)slaren
2023-09-17CUDA: fix peer access logic (#3231)Johannes Gäßler
2023-09-17CUDA: enable peer access between devices (#2470)Johannes Gäßler
2023-09-17CUDA: fix scratch malloced on non-main device (#3220)Johannes Gäßler
2023-09-16Enable build with CUDA 11.0 (make) (#3132)Vlad
2023-09-13CUDA: mul_mat_q RDNA2 tunings (#2910)Johannes Gäßler
2023-09-13CUDA: fix LoRAs (#3130)Johannes Gäßler
2023-09-11CUDA: fix mul_mat_q not used for output tensor (#3127)Johannes Gäßler
2023-09-11CUDA: lower GPU latency + fix Windows performance (#3110)Johannes Gäßler
2023-09-11CUDA: add device number to error messages (#3112)Johannes Gäßler
2023-09-08sync : ggml (CUDA GLM RoPE + POSIX) (#3082)Georgi Gerganov
2023-09-042x faster (rms) norm cuda kernels (3.7% e2e improvement) (#2985)Jiahao Li
2023-09-01cuda : vsubss4 for older versions of ROCm/clang (#2942)Engininja2
2023-08-28CUDA: fix RoPE asserts, block sizes (#2833)Johannes Gäßler
2023-08-27falcon : fix CUDA inference by making K and Q contiguous (#2830)Georgi Gerganov
2023-08-27k_quants tuning for Falcon-7b (#2816)Kawrakow
2023-08-25ROCm Port (#1087)Henri Vasserman
2023-08-25cuda : add RoPE kernel for mode == 2 (NeoX) (#2760)Georgi Gerganov
2023-08-23llm : add Falcon support (#2717)Georgi Gerganov
2023-08-22CUDA: use mul_mat_q kernels by default (#2683)Johannes Gäßler
2023-08-22Fix CUDA softmax by subtracting max value before exp (#2665)Jiahao Li
2023-08-22ggml-cuda : use graph allocator (#2684)slaren
2023-08-22ggml : sync latest (SAM + SD operators, CUDA alibi) (#2709)Georgi Gerganov
2023-08-18llama : add benchmark example (#2626)slaren
2023-08-14CUDA: launch_bounds, small q4_K, q5_K mmq refactor (#2596)Johannes Gäßler
2023-08-13CUDA: Fixed OpenLLaMA 3b mmq, reduced compile time (#2590)Johannes Gäßler
2023-08-09CUDA: tuned mul_mat_q kernels (#2546)Johannes Gäßler
2023-08-05CUDA: faster k-quant mul_mat_q kernels (#2525)Johannes Gäßler
2023-08-04CUDA: use min compute capability of GPUs actually used (#2506)Cebtenzzre
2023-08-04CUDA: check if event is NULL before cudaStreamWaitEvent (#2505)Cebtenzzre
2023-08-02CUDA: faster non k-quant mul_mat_q kernels (#2483)Johannes Gäßler
2023-08-02CUDA: Fix models with output size != 32000 (#2480)Johannes Gäßler
2023-07-31CUDA: mmq CLI option, fixed mmq build issues (#2453)Johannes Gäßler
2023-07-31CUDA: Implemented row flattening for non-glm RoPE (#2468)Johannes Gäßler
2023-07-31CUDA: fewer memory bank conflicts for mul_mat_q (#2458)Johannes Gäßler
2023-07-29CUDA: Quantized matrix matrix multiplication (#2160)Johannes Gäßler
2023-07-29CUDA: faster multi GPU synchronization (#2448)Johannes Gäßler
2023-07-25Fix Q4_K and Q5_K for QK_K = 64 on CUDA (#2359)Kawrakow
2023-07-24make rms_norm_eps a parameter (#2374)slaren
2023-07-24ggml : sync (unary ops refactor, static-correctness) (#2370)Georgi Gerganov
2023-07-24Some more Q4_K and Q5_K speedup on CUDA (#2346)Kawrakow
2023-07-23ggml: move op parameters from tensors to ggml_tensor::op_params (#2333)slaren
2023-07-23llama : grouped-query attention + LLaMAv2 70B support (#2276)Georgi Gerganov
2023-07-23Speed up Q4_K (#2322)Kawrakow
2023-07-22CUDA: Fixed 7b q3_K_S with mul_mat_vec_q (#2313)Johannes Gäßler