Age | Commit message (Collapse) | Author |
|
* llama : ggml-backend integration
* ggml-backend : add names to buffers
* fix unmap after loading
* batched-bench : add tensor_split param
* llama : check for null tensor_split
* ggml-backend : increase GGML_MAX_BACKENDS
* improve graph splitting, partial fix for --no-kv-offload
* cuda : add ggml-backend split buffer support
* cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available)
* ggml : fix null backend dereference (#4807)
* ggml : fix null backend dereference
* ggml : also check ggml_backend_is_cpu
* test-backend-ops : check buffer allocation failures
* llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row)
* ggml : fix mul_mat_id work size
* llama : rewrite session kv load/set without graphs
* minor
* llama : only initialize used backends, free backends on context free
* llama : abort ctx if cuda backend init fails
* llama : rewrite lora with ggml-backend and compute on CPU
ggml-ci
* llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer
* opencl : add ggml-backend buffer type
* cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf)
* llama : on Metal, by default offload the full model
ggml-ci
* metal : page align the data ptr (#4854)
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* cuda : fix split buffer free
* address review comments
* llama-bench : add split-mode parameter
* fix whitespace
* opencl : fix double initialization
* server : add --split-mode parameter
* use async copy and compute to improve multi-gpu performance
ggml-ci
* use async memcpys to copy the graph outputs to the CPU
* fix opencl
* use a host buffer for the cpu compute buffer for faster copies to the gpu
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
|
|
|
|
* sync : ggml (part 1)
* sync : ggml (part 2, CUDA)
* sync : ggml (part 3, Metal)
* ggml : build fixes
ggml-ci
* cuda : restore lost changes
* cuda : restore lost changes (StableLM rope)
* cmake : enable separable compilation for CUDA
ggml-ci
* ggml-cuda : remove device side dequantize
* Revert "cmake : enable separable compilation for CUDA"
This reverts commit 09e35d04b1c4ca67f9685690160b35bc885a89ac.
* cuda : remove assert for rope
* tests : add test-backend-ops
* ggml : fix bug in ggml_concat
* ggml : restore `ggml_get_n_tasks()` logic in `ggml_graph_plan()`
* ci : try to fix macOS
* ggml-backend : remove backend self-registration
* ci : disable Metal for macOS cmake build
ggml-ci
* metal : fix "supports family" call
* metal : fix assert
* metal : print resource path
ggml-ci
---------
Co-authored-by: slaren <slarengh@gmail.com>
|
|
ggml-ci
|
|
* sync : ggml (backend v2) (wip)
* sync : migrate examples and llama.cpp to dynamic graphs (wip)
* sync : update tests + fix max op params to 64
ggml-ci
* sync : ggml-cuda
ggml-ci
* llama : fix save/load state context size
ggml-ci
* sync : try to fix build on tvOS
* sync : pass custom graph sizes in training examples
* sync : update graph copies to new ggml API
* sync : update sync-ggml.sh with new files
* scripts : fix header in sync script
* train : fix context size calculations
* llama : increase inference graph size up to 4096 nodes
* train : allocate grads for backward graphs
* train : allocate grads for gb_tmp
|
|
* ggml : move FP16 <-> FP32 stuff to ggml-impl.h
ggml-ci
* tests : fix ARM build
* ggml : explicitly initialize deprecated type traits
* ggml : add math.h to ggml-impl.h
* ggml : remove duplicate static assert macros
* ggml : prefix lookup tables with ggml_
ggml-ci
* ggml-impl : move extern "C" to start of file
|