ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2024-01-12	llama : ggml-backend integration (#4766)	slaren
	* llama : ggml-backend integration * ggml-backend : add names to buffers * fix unmap after loading * batched-bench : add tensor_split param * llama : check for null tensor_split * ggml-backend : increase GGML_MAX_BACKENDS * improve graph splitting, partial fix for --no-kv-offload * cuda : add ggml-backend split buffer support * cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available) * ggml : fix null backend dereference (#4807) * ggml : fix null backend dereference * ggml : also check ggml_backend_is_cpu * test-backend-ops : check buffer allocation failures * llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row) * ggml : fix mul_mat_id work size * llama : rewrite session kv load/set without graphs * minor * llama : only initialize used backends, free backends on context free * llama : abort ctx if cuda backend init fails * llama : rewrite lora with ggml-backend and compute on CPU ggml-ci * llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer * opencl : add ggml-backend buffer type * cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf) * llama : on Metal, by default offload the full model ggml-ci * metal : page align the data ptr (#4854) * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix split buffer free * address review comments * llama-bench : add split-mode parameter * fix whitespace * opencl : fix double initialization * server : add --split-mode parameter * use async copy and compute to improve multi-gpu performance ggml-ci * use async memcpys to copy the graph outputs to the CPU * fix opencl * use a host buffer for the cpu compute buffer for faster copies to the gpu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2023-12-01	build : fix build info generation and cleanup Makefile (#3920)	Jared Van Bortel
	* cmake : fix joining of REAL_GIT_DIR * fix includes with help from include-what-you-use * make : remove unneeded deps and add test-rope target * fix C includes in C++ source files * Revert "fix includes with help from include-what-you-use" This reverts commit 635e9fadfd516d4604a0fecf4a854bfb25ad17ae.
2023-10-20	CLBlast: Add outer loops over src0 for broadcasting in mulmat	shibe2
	Reduce repeated dequantization of the same data.
2023-10-18	opencl : fix element-wise multiplication (#3656)	shibe2

2023-10-17	CLBlast: Fix temporary buffer size for f16 conversion (wsize)	shibe2
	Fix buffer overflow. Reduce the size to fit just one 2D slice. Assert sufficient size.
2023-10-12	CLBlast: Fix matrix-vector multiplication (#3544)	shibe2

2023-10-05	CLBlast: Fix handling of on-device tensor data	shibe2
	Fix uploading tensor data to device, including 3D, 4D, and non-contiguous tensors. Use correct offsets into data that is already in VRAM. Correct handling of OpenCL events when multiple commands are queued.
2023-10-02	CLBlast: Add broadcast support for matrix multiplication (#3402)	shibe2
	Broadcast src0 into src1 across dimensions 2 and 3 when needed. This is required for models that use GQA.
2023-09-21	ggml-opencl.cpp: Make private functions static (#3300)	shibe2

2023-09-04	ggml-opencl : store GPU buffer in ggml_tensor::extra (#2994)	slaren

2023-09-03	opencl : fix a bug in ggml_cl_pool_malloc() for ggml_cl_mul_mat_f32() (#2955)	Wentai Zhang
	Co-authored-by: Wentai Zhang <wentaizhang@tencent.com>
2023-07-07	Fix opencl by wrap #if-else-endif with \n (#2086)	Howard Su

2023-07-04	[ggml] fix index for ne03 value in ggml_cl_mul_f32 (#2088)	Govlzkoy

2023-06-29	Porting the improved K-Quant CUDA kernels to OpenCL (#1966)	LostRuins
	* Added broken new q4k quant * xx + ib0 * Fix q2_k fast kernel * Use preprocessor for QK_K * Add q6_k fast matmul kernel * ported q3k speedup successfully * ported q2k and q5k speedups * remove old dot kernels and template * fixed global const struct types * fixing address spaces * fixed string too long CI issue --------- Co-authored-by: 0cc4m <picard12@live.de>
2023-06-17	ggml : fix warnings under MSVC (#1908)	Howard Su

2023-06-16	opencl : support k-quants (#1836)	0cc4m
	* Porting q2_k kernel to OpenCL * Set global and local sizes for kernel calls for dequantizing k-quants * Added q6_k kernel * Fix q4_k opencl struct order * Replace uchar with uint8_t * Finish dequant kernels * Added OpenCL DMMV kernels * Fix q2_k, improve code * Fix q3_k * Shorten switch statements * Improve code formatting --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
2023-06-12	Leverage mmap for offloading tensors to GPU (#1597)	Howard Su
	* Rebase to latest * Show progress * Add assert to make sure we only allocate temp buffer for non-CPU backend tensor Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2023-06-09	OpenCL: Add release memory (#1741)	Robert Sung-wook Shin
	* Add opencl release memory * Rename function name
2023-06-06	Multi GPU support, CUDA refactor, CUDA scratch buffer (#1703)	Johannes Gäßler
	* CUDA multi GPU + scratch ggml_cuda_compute_forward Tensor parallelism ggml_cuda_add ggml_cuda_rms_norm ggml_cuda_silu CUDA scratch buffer --main-gpu CLI option
2023-06-06	Clblast fixes + enhancements to save VRAM and offload more layers (#1675)	LostRuins
	* Use events instead of clFinish, where possible * OpenCL: Don't load gpu layers into RAM, add mul_f32 kernel * Reduce queueing overhead for contiguous tensors by using single mul kernel call * Adapt to #1612 cl_mem malloc changes * Reduce code duplication between cuda and opencl branches * Improve implementation * Clblast fixes + enhancements to save VRAM: 1. Change all Clblast buffers to CL_MEM_READ_WRITE, as the pool malloc currently doesn't properly handle them. 2. When recycling buffers in pool malloc, always assign the SMALLEST available buffer that fits, instead of the FIRST available buffer 3. When failing to recycle a buffer in pool malloc (all too small), instead recycle the largest available free buffer by resizing it. * change max value size_t to use limits * removed flags from the CL pool malloc, apply code tidying suggestions.
2023-06-04	OpenCL: Fix duplication of layers in VRAM and RAM, add GPU mul kernel (#1653)	0cc4m
	* Use events instead of clFinish, where possible * OpenCL: Don't load gpu layers into RAM, add mul_f32 kernel * Reduce queueing overhead for contiguous tensors by using single mul kernel call * Adapt to #1612 cl_mem malloc changes * Reduce code duplication between cuda and opencl branches * Improve implementation
2023-05-28	opencl : no need to allocate cl_mem on heap (#1612)	Howard Su

2023-05-28	opencl : use strstr to check if fp16 supported (#1611)	Howard Su
	* Use strstr to check if fp16 supported * Ensure ext_buffer is null terminated
2023-05-23	Fix handling of "invalid property" when creating OpenCL command queue (#1565)	Maarten ter Huurne
	The `clCreateCommandQueue()` function will return the code `CL_INVALID_QUEUE_PROPERTIES` when passed unsupported properties, not `CL_INVALID_PROPERTY` as the original code was checking for.
2023-05-23	OpenCL Token Generation Acceleration (#1459)	0cc4m
	* Move back to C++ for OpenCL * Refactor OpenCL code to work more like the CUDA code, add missing functions * Deduplicate dequant kernels * Add OpenCL compile options * Use compile args for preprocessing constants * Restore default platform + device selection by id behavior --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Henri Vasserman <henv@hot.ee>