Age | Commit message (Collapse) | Author |
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* llama-bench: enable having different number of threads for tg and pp
* Add -tgb to usage
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Adding ability to use THP on Linux
* Use the actual page size4 used for mmap also in munmap
* Add -thp to llama-bench
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* A better way to measure the cost of ggml_barrier
* Smart expert selection
* Add ser option to llama-bench
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* This reduces compute buffer size for MLA
* This should accomplish it for standard attention
* Much better
* Better concat for contiguous tensors
If all the op does is to concatenate the second tensor
to the first, why would we want to have a loop?
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
The `-mla` command line option turns into an int from a bool.
mla = 0: use standard attention
mla = 1: use MLA with transposed cache
mla > 1: use MLA without transposed cache
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Give the user the option to override where model weights are stored
* Fix ggml_nbytes() problem and cleanup
For a tensor with zero elements ggml_nbytes() was returning
uint64_t::max, and this was causing graph allocation failure.
* Add timing info to CUDA graph evaluation
* Add more timing info
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Fusing MoE up * unary(gate)
* Fusing MoE up * unary(gate): CUDA
We get ~13% speedup for PP-512 and ~2% for TG-128
for DeepSeek-Lite
* On CUDA also fuse MoE down * (up * unary(gate))
in case the MUL_MAT_ID op for the down experts is the next
op in the graph.
* Command line option to enable fused MoE up*unary(gate)
* Add fmoe option to llama-bench
* Adding forgotten gelu, relu, silu on ARM
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Adding q8_KV - Basics + AVX2 gemm/gemv
* q8_KV: Better AVX2 gemm
* q8_KV: Better Zen4 gemm
We get 225.7 t/s for L3-8B. In comparison q8_0 without
run-tinme-repacking is at 169 t/s.
* q8_KV: AVX2 gemm/gemv
We get 254 t/s for L3-8B vs 194 t/s for q8_0 without rtr.
* q8_KV: be able to use it for K cache
This required quite a few fixes in ggml and llama.cpp:
* ggml: do not calculate row size as n/block_size*type_size. I had
removed most of it when implementing the quants with per row scale,
bit it was stull lurking in ggml_copy. Not sure if these were the last
remnants of ggmil-style row sizes, or if there are still places left
* llama.cpp: get rid of the the 1d K cache assumption. Create and manage
the K-cache as a 2D tensor so we can have per row meta data as needed
by q8_KV.
Using q8_KV for K-cache results in non-negligible performance gains.
More details to follow, but for DeepSeek-Lite with MLA, we get
18% speedup for PP-8192 compared to q8_0 K-cache.
* q8_KV: be able to use it for K cache in FA
* q8_KV: repack it for K*Q in FA
* q8_KV: slightly faster gemv on Zen4
* q8_KV: slightly faster gemv on Zen4
* q8_KV: ARM_NEON
We get PP-512 = 167 t/s for L3-8B without interleaving!
We do the interleaving on the fly, so I wonder if this
could be done for other quants as well.
* q8_KV: use it in FA on NEON
* q8_KV_r8 - repacked q8_KV
On Zen4 it is slower than q8_k_r8 (292 vs 370 t/s)
This makes no sense whatsoever as the q8_KV_r8 GEMM is
basically the q8_k_r8 GEMM with the unnecessary block stuff
removed (so, one would think that it would be faster).
* q8_KV_r8: don't use nrc_y = 16 on Zen4
This is faster - 350 t/s. Why?
Much better than the 290 t/s we had before, but still slower
than the 370 t/s for q8_k_r8.
* q8_KV: nrc_y = 16 also doesn't pay off in FA
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Load all MoE experts during warmup
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
* Unify warmup to one token
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
|
|
* Deepseek MLA Optimizations
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
* Make MLA optional
* Remove some unnecessary copies in the MLA attention
* Deepseek MLA Optimizations V2 (#195)
* Avoid allocating MHA KV cache when MLA is turned on
* Added missing gguf-py file
* Added final optimizations
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
* Make sure we do have wk_b and wv_b before enabling MLA
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Use type_k and type_v to set the types of the MLA caches
They were hard-coded at f16.
On my Ryzen-7950X with native bf16 support I get a fairly
significant PP performance boost with bf16 KV-cache:
PP-4096 = 320 t/s up from 292 t/s with fp16 KV-cache.
* Better gemm strategy when nth > nhead
It gives a ~10% PP performance boost for DeepSeek-Lite with 32 threads
(with or without MLA).
Before this commit, when nth > nhead heads were processed
sequentially with all nth threads participating in each
matrix multiplication. Now we ind the gcd of nhead and
nth and split threads into nth/gcd groups, each group
processing nhead/gcd heads.
---------
Co-authored-by: Saood Karim <saood05@gmail.com>
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Slightly faster AVX2 implementation for q4_k_r4
* Even better AVX2 implementation for q4_k_r4
We now arrive at PP-512 = 328 t/s for LLaMA-3.1-8B on a
Ryzen-5975WX CPU, up from 291 t/s when I last measured
on 3c5f8722.
With FA and Q8_0 K-cache we get to 339.5 t/s.
* Fix llama-bench labels that I broke with #181
* Faster AVX2 implementation for q5_k_q4
We arrive at 302 t/s for LLaMA-3.1-8B on a Ryzen-5975WX CPU,
up from 273 t/s.
* Use AVX2 implementation of q4_k_r4 and q5_k_r4 also on Zen4
After the changes I made to AVX2, it ends up being slightly faster
compared to what I had for Zen4.
* Minor tweak
* Cleanup
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Adding gp option to llama-bench
Similar to pg, but it only looks at TG speed with a given
prompt length.
* Make q8_0_r4 work with tensor row sizes that are not a multiple of 128
They still need to be divisible by 32.
* Make q8_0_r4 work with tensor row sizes that are not a multiple of 128
.. on NEON
* Make q8_0_r4 work with tensor row sizes that are not a multiple of 128
.., on AVX2
* Make q4_0_r4 work with tensor row sizes that are not a multiple of 128
.., on AVX2
* Make q4_0_r4 work with tensor row sizes that are not a multiple of 128
... on NEON
* Make q4_0_r4 work with tensor row sizes that are not a multiple of 128
... on Zen4.
Also fix q8_0 K-cache for head sizes that are not multiple of 128.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
iq4_k_r4 (#148)
* Slightly better matrix x vector on Zen4/AVX2 for iq2_k_r4, iq3_k_r4, iq4_k_r4
More importantly: simplify.
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Be able to repack tensors at run time
* Repack: also add bf16 as repackable type
* Repack: make sure number of rows is a multiple of the packing
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Adding q6_0 - basics + AVX2/Zen4 working
* Adding q6_0: CUDA dequantize works, but not mmvq
* Adding q6_0: CUDA mmvq works
* Adding q6_0: CUDA cpy, so Q6_0 can be used for KV-cache
* Add q6_0 to CPU flash attention
Disappointing result: for LlaMA-3.2-1B, q6_0 K- and V-cache
gives about the same PPL as q8_0 K-cache and q4_0 V-cache,
while needing the exact same RAM.
I.e., what was the point?
* q6_0: slightly better kv-cache result
Better than q8_0+q4_0, but not as good as q8_0+iq4_nl
* q6_0: works on ARM_NEON
* q6_0: dequantize works on Metal, but not vector dot product
* q6_0: it now works on Metal
Outperforms q5_0 by a significant margin. E.g.
| model | size | params | backend | ngl | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| llama 8B Q6_0 | 6.08 GiB | 8.03 B | Metal | 100 | 4 | tg128 | 44.02 ± 0.08 |
| llama 8B Q5_0 | 5.21 GiB | 8.03 B | Metal | 100 | 4 | tg128 | 40.13 ± 0.12 |
| llama 8B Q6_0 | 6.08 GiB | 8.03 B | Metal | 100 | 4 | pp512 | 500.55 ± 0.32 |
| llama 8B Q5_0 | 5.21 GiB | 8.03 B | Metal | 100 | 4 | pp512 | 448.02 ± 0.27 |
* q6_0: can now be used for kv-cache on Metal
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Zen4 Flash Attnetion: WIP bf16
* Zen4 Flash Attnetion: bf16 seems to be working
* Zen4 Flash Attnetion: improving bf16
* Zen4 Flash Attnetion: improving bf16
It is better (slightly faster) to first convert Q
to bf16 before processing each block of q_step rows.
This requires D*q_step*sizeof(bf16) bytes, so at
most 4 kb for the head sizes we support, so we can
just allocate on the stack instead of reserving and
passing a work buffer in ggml.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Softcap: WIP
Fuses scale + tanh + scale as used for softcaping in some
models.
Just CPU for now. ~1.4% for PP-512 on Gemma2-9b, no effect on TG.
Somewhat surprisingly the improvement does not increase as I
go to longer contexts. Gemma2 does softcap on K*Q, which grows
quadratically with context length, so I would have thought
the benefit from fusing scale, tanh, scale would increase.
But no, no luck.
* softcap: CUDA
* softcap: CUDA
~1% speedup for Gemma2-9b
* softcap: Metal and NEON
About 1% speedup.
* Simdified gelu
Gives ~1% speedup for Gemma2-9b prompt processing on AVX512/AVX2.
It looks like the gelu operation is memory bound on my CPU's
after SIMD-ifying it. By not using the 128 kb gelu lookup table
we gain a small advantage.
On the M2-Max the lookup table is slightly faster than the SIMD
version, so left the lookup table for ARM_NEON.
* softcap, tanh: avoid NaNs for large arguments (AVX2, AVX512)
Not that I have encountered this in practice, but just to be sure.
This does it for AVX512 and AVX2, still need a guard for ARM_NEON.
* llama-bench: add ability to turn off warmup runs
So we don't need to wait forever on, e.g., benchmarks involving
long contexts.
* softcap, tanh: avoid NaNs for large arguments (NEON)
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Merge mainline
* Fix after merge
* Remove CI check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Merging mainline - WIP
* Merging mainline - WIP
AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.
* Merging mainline - fix Metal
* Remove check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Show "<backend_name>+RPC" when RPC offloading is used
|
|
* move BLAS to a separate backend
* rename GGML_USE_OPENBLAS to GGML_USE_BLAS
* alloc : reuse same buffer when the same buffer type if used multiple times
* set number of threads automatically for openblas and blis
* sched : print assignments when GGML_SCHED_DEBUG env variable is set
* sched : allow ops with weights on an incompatible buffer type
This will cause the weight to be copied to a backend that supports the
op, which is very costly. The weight should have been stored in a buffer
of a backend that can run the op, but llama.cpp cannot do this
automatically at the moment.
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
llama-llava-cli, etc... (#7809)
* `main`/`server`: rename to `llama` / `llama-server` for consistency w/ homebrew
* server: update refs -> llama-server
gitignore llama-server
* server: simplify nix package
* main: update refs -> llama
fix examples/main ref
* main/server: fix targets
* update more names
* Update build.yml
* rm accidentally checked in bins
* update straggling refs
* Update .gitignore
* Update server-llm.sh
* main: target name -> llama-cli
* Prefix all example bins w/ llama-
* fix main refs
* rename {main->llama}-cmake-pkg binary
* prefix more cmake targets w/ llama-
* add/fix gbnf-validator subfolder to cmake
* sort cmake example subdirs
* rm bin files
* fix llama-lookup-* Makefile rules
* gitignore /llama-*
* rename Dockerfiles
* rename llama|main -> llama-cli; consistent RPM bin prefixes
* fix some missing -cli suffixes
* rename dockerfile w/ llama-cli
* rename(make): llama-baby-llama
* update dockerfile refs
* more llama-cli(.exe)
* fix test-eval-callback
* rename: llama-cli-cmake-pkg(.exe)
* address gbnf-validator unused fread warning (switched to C++ / ifstream)
* add two missing llama- prefixes
* Updating docs for eval-callback binary to use new `llama-` prefix.
* Updating a few lingering doc references for rename of main to llama-cli
* Updating `run-with-preset.py` to use new binary names.
Updating docs around `perplexity` binary rename.
* Updating documentation references for lookup-merge and export-lora
* Updating two small `main` references missed earlier in the finetune docs.
* Update apps.nix
* update grammar/README.md w/ new llama-* names
* update llama-rpc-server bin name + doc
* Revert "update llama-rpc-server bin name + doc"
This reverts commit e474ef1df481fd8936cd7d098e3065d7de378930.
* add hot topic notice to README.md
* Update README.md
* Update README.md
* rename gguf-split & quantize bins refs in **/tests.sh
---------
Co-authored-by: HanClinto <hanclinto@gmail.com>
|
|
|
|
* common : gpt_params_parse do not print usage
* common : rework usage print (wip)
* common : valign
* common : rework print_usage
* infill : remove cfg support
* common : reorder args
* server : deduplicate parameters
ggml-ci
* common : add missing header
ggml-ci
* common : remote --random-prompt usages
ggml-ci
* examples : migrate to gpt_params
ggml-ci
* batched-bench : migrate to gpt_params
* retrieval : migrate to gpt_params
* common : change defaults for escape and n_ctx
* common : remove chatml and instruct params
ggml-ci
* common : passkey use gpt_params
|
|
ggml-ci
|
|
compare-commits.sh : hide stdout, use -oe to print markdown
|
|
|
|
* common : normalize naming style
ggml-ci
* common : match declaration / definition order
* zig : try to fix build
|
|
* phi3 : duplicate rope factors in each layer
phi3 : set phi-3 model type as 14B
model loader : simplify the process for duplicating model tensors
llama-bench : remove default pg test
* replace bool parameters in llama_model_loader with named flags
|
|
|
|
|
|
* ggml : add ggml_flash_attn_ext API
* ggml : fix GQA support in ggml_flash_attn_ext
* ggml : online attention (CPU)
* metal : initial implementation
* metal : f16 precision
* metal : reduce branches
* metal : specialize for head size
* wip : 8 rows per simd group
* wip : 4 rows per simd group
* wip : template for rows per warp
* metal : parallelize across KV size
* metal : parallel reduce across heads
* metal : efficient flash_attn_f16 implementation
* metal : avoid redundant loads of the attention
* metal : scale and mask in matrix form
* metal : fix comment
* llama : avoid ggml_cast, use F32 query
* metal : add parallel reduce version (disabled)
* metal : move output into local memory + optimize
- the result from each simdgroup now stays in the registers
- significantly reduced SRAM usage
- more efficient skipping of -INF blocks
- avoid simdgroup barrier in hot loop
- add comments
* metal : add tests, fix scaling, support C > 32
* metal : improve precision
* ggml : fix f16 mad
* metal : minor
* metal : support Q > 8
* tests : add ATTN tests
* metal : disable buffer allocation logs
* tests : more
* metal : faster inner loop for C == 32
* metal : fix array initialization
* tests : ifdef
* ggml : switch to padded F16 mask for ggml_soft_max, ggml_flash_attn_ext
* ggml : fix ggml_soft_max mask requirement
* cuda : fix soft_max to use correct mask size
* cuda : add flash_attn kernel (wip)
* metal : optimize softmax for C > 32
* metal : optimize softmax
* tests : minor fix
* cuda : avoid zeroing fragments
* tests : update dims
* cuda : fix __hisinf() result check
* cuda : avoid warp_reduce for smax
* cuda : use int instead of int64_t
Noticeably improves performance (thanks to Johannes)
* cuda : make loops use the same loop values
Thanks Johannes again for the tip
* cuda : unroll some of the loops
* cuda : avoid __hisinf branches
* cuda : use half2 in softmax
* cuda : switch to 1 warp for bs > 16
* cuda : speed-up reduce part of the kernel
* cuda : unroll Q*K^T loop
* cuda : fix -INF block check
* cuda : simplify softmax
* cuda : fix matrix names
* cuda : minor
* llama : adapt to F16 KQ_pos
* llama : adapt new models to F16 KQ_mask
* ggml : fix F16 store (ARM NEON)
* llama : fix type of KQ_mask and KQ_pos
* ggml : fix CPU soft_max
* tests : add hs=256
* cuda : fix build
* metal : improve perf via smaller int registers
* cuda : adapt soft_max to F16 mask and pos
* CUDA: faster FlashAttention, kernel for bs == 1
* 16 cols for Phi-2
* no vec for hs, no hs==256 ncols==32 for Volta
* adjust kernel selection logic
* 4 warps, 256 stride for all D
* no ncols == 64
* Multiple parallel blocks for batch size 1
* fix compile warnings
* fix excessive KQ_b loads
* fix cmake build
* fix KV cache padding, NaN from INFINITY (#6438)
* llama : flash_attn cparam + fix defrag
* server: support flash_attn param
* server: bench: enable flash_attn param
* CUDA: refactor host code, dyn. par. blocks
* fix flash_attn_vec_f16 race condition
* flush softmax exp below threshold to 0
* store temp KQ in registers
* Calculate KQ as FP32 if KQV has GGML_PREC_F32
* Add __hgt2_mask implementation for CUDA 11
* fix KQ FP32 precision fpr parallel_blocks > 1
* llama-bench : add -fa,--flash-attn arg
* metal : add BS=1 kernel for flash attention (#6508)
* metal : add BS=1 kernel for flash attention (wip)
* metal : support more than 1 warps
* metal : opts
* metal : opt
* metal : switch to parallel reduce
* metal : reduce registers
* metal : simplify
* metal : initial FA vec kernel
* metal : use F32 attention accumulators
* batched-bench : add fattn arg
* llama : simplify llama_build_kv_store
ggml-ci
* llama : adapt build_olmo to changes
* ggml : fix arm fp16 store on windows
* metal : clean-up
* metal : clean-up kernel code
* metal : minor
* tests : remove benchmarks
ggml-ci
* ggml : fix avx512 const correctness
ggml-ci
* ggml : fix soft_max with bias on CPU
ggml-ci
* common : print --flash-attn in help
* ggml : fix num dimensions in ggml_flash_attn_ext
* llama : force disable flash attention for incompatible models
* ggml : ggml_soft_max support F16/F32 mask/pos
ggml-ci
* cuda : uint -> uint32_t
* cuda : "constexpr dim3" -> "const dim3"
ggml-ci
* cuda : try to fix __hgt2_mask
ggml-ci
* ggml : add TODO's for F16/F32 mask/pos support in other backends
* llama : replace bool need_kq_pos with use_alibi
* llama : prep ALiBi support for BERT models
ggml-ci
* llama : fix n_batch requirements
ggml-ci
* cont
* server : add help for --flash-attn arg
* llama : disable FA for AMD
* tests : remove TMP_ATTN_BENCH
ggml-ci
* llama : support save/load state with FA enabled
ggml-ci
* ci : add CUDA save-load-state tests
ggml-ci
* llama : llama_kv_cache_clear zeroes data + fix save-load seq
ggml-ci
* llama : fix copy-paste errors, add TODO
* llama : disallow incompatible states
* llama : update llama_state_get_size after v_trans field
* metal : remove tmp log
* llama : add static reminder for llama_state_get_size
* metal : fix max nsg
ggml-ci
* ci : fix arg order
ggml-ci
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>
|
|
This change upstreams llamafile's cpu matrix multiplication kernels
which improve image and prompt evaluation speed. For starters, Q4_0
and Q8_0 weights should go ~40% faster on CPU. The biggest benefits
are with data types like f16 / f32, which process prompts 2x faster
thus making them faster than quantized data types for prompt evals.
This change also introduces bona fide AVX512 support since tinyBLAS
is able to exploit the larger register file. For example, on my CPU
llama.cpp llava-cli processes an image prompt at 305 tokens/second,
using the Q4_K and Q4_0 types, which has always been faster than if
we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With
this change, f16 LLaVA performance leap frogs to 464 tokens/second.
On Intel Core i9-14900K this change improves F16 prompt perf by 5x.
For example, using llama.cpp at HEAD with Mistral 7b f16 to process
a 215 token prompt will go 13 tok/sec. This change has fixes making
it go 52 tok/sec. It's mostly thanks to my vectorized outer product
kernels but also because I added support for correctly counting the
number of cores on Alderlake, so the default thread count discounts
Intel's new efficiency cores. Only Linux right now can count cores.
This work was sponsored by Mozilla who's given permission to change
the license of this code from Apache 2.0 to MIT. To read more about
what's improved, and how it works, see: https://justine.lol/matmul/
|
|
|
|
* k_cache: be able to use Q5_0
* k_cache: be able to use Q5_1 on CODA
* k_cache: be able to use Q5_0 on Metal
* k_cache: be able to use Q5_1 on Metal
* k_cache: be able to use IQ4_NL - just CUDA for now
* k_cache: be able to use IQ4_NL on Metal
* k_cache: add newly added supported types to llama-bench and CUDA supports_op
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* backend : offload large batches to GPU
* fix hip
* code cleanup
* fix CUDA split buffers
* Update ggml-backend-impl.h
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* cuda : fix memset without set_device
* imatrix : remove sched affix from weight names
* sched : add a new split if the current one has too many inputs
reduce max inputs per split
more cleanup
* update backends
ggml-ci
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
|
|
|
|
There several places where a gguf context is allocated. A call to gguf_free
is missing in some error paths. Also on linux, llama-bench was missing a
fclose.
|
|
* llama : add pipeline parallelism support for batch processing with multiple CUDA GPUs
ggml-ci
* server : add -ub, --ubatch-size parameter
* fix server embedding test
* llama : fix Mamba inference for pipeline parallelism
Tested to work correctly with both `main` and `parallel` examples.
* llama : limit max batch size to n_batch
* add LLAMA_SCHED_MAX_COPIES to configure the number of input copies for pipeline parallelism
default increase to 4 (from 2)
changing this value may improve performance for some systems, but increases memory usage
* fix hip build
* fix sycl build (disable cpy_tensor_async)
* fix hip build
* llama : limit n_batch and n_ubatch to n_ctx during context creation
* llama : fix norm backend
* batched-bench : sync after decode
* swiftui : sync after decode
* ggml : allow ggml_get_rows to use multiple threads if they are available
* check n_ubatch >= n_tokens with non-casual attention
* llama : do not limit n_batch to n_ctx with non-casual attn
* server : construct batch with size of llama_n_batch
* ggml_backend_cpu_graph_compute : fix return value when alloc fails
* llama : better n_batch and n_ubatch comment
* fix merge
* small fix
* reduce default n_batch to 2048
---------
Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* llama-bench : add embeddings option
* llama-bench : do not hard code embd default value
---------
Co-authored-by: slaren <slarengh@gmail.com>
|
|
* suport multiple cards: split-mode - layer|row
* rm warning
* rebase with master, support tow new OPs, close feature for -sm=row, fix for unit test
* update news
* fix merge error
* update according to review comments
|
|
* cleanup unused --no-mul-mat-q,-nommq, -mmq, --mul-mat-q, mul_mat_q
* remove: mul_mat_q in compare llama bench and usage
* update llama-bench
---------
Co-authored-by: slaren <slarengh@gmail.com>
|
|
* coda : normalize enum names
ggml-ci
* code : cont
* code : cont
|
|
* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h
* Reverted Makefile
* Fixed include
* Removed sched.h from ggml.h, moved ggml_get_numa_affinity into ggml.c, removed trailing whitespace and fixed up a few inconsistent variables
* removed trailing whitespace
* Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h
* Reverting Makefile
* Fixed a number of issues with the move from BOOL to ggml_numa_strategies. Added a note about mirror mode note being implemented yet
* Removing MIRROR_MODE code for this PR
* Removing last bit of MIRROR_MODE code for this PR
* Removing unneeded branch in server.cpp example and moving get_numa_affinity and making it static
* Fixed lingering init_llama_backend() bool calls in tests and examples
* Remote enum llama_numa_strategies
* Revert bad merge with dynatemp flags
* add missing enum ggml_numa_strategies declaration and revert sync problem with master
* add missing enum ggml_numa_strategies declaration
* fixed ggml_init_numa variable
* Update ggml.h
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Update READMEs with info about numa flags, change INTERLEAVE strategy name to DISTRIBUTE everywhere, implement the improved distribution strategy from @rankaiyx, fix a spelling mistake and un-merge some bad merges
* split numa init out from llama_backend_init and created llama_numa_init. Updated all code paths and samples
* Fix up some boolean vs enum comparisons
* Added #ifdefs for non-Linux OS that don't have cpu_set_t datatype
* Update ggml.h
Align enum values
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml.c
Remove whitespace
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml.c
align paremeters
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update examples/server/server.cpp
remove whitespace and align brace
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/common.cpp
Remove whitespace and align brace
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* unified ggml_numa_strategy enum and fixed text alignment in server.cpp example
* Update ggml.c
simplified return for platforms without NUMA support
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* removed redundant else from cli argument processing of --numa
* whitespace
---------
Co-authored-by: root <root@nenya.lothlorien.ca>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
|
|
|
|
* update guide for make installation, memory, gguf model link, rm todo for windows build
* add vs install requirement
* update for gpu device check
* update help of llama-bench
* fix grammer issues
|
|
* add --no-mmap, show sycl backend
* fix conflict
* fix code format, change print for --no-mmap
* ren no_mmap to mmap, show mmap when not default value in printer
* update guide for mmap
* mv position to reduce model reload
|
|
* llama : remove LLAMA_MAX_DEVICES from llama.h
ggml-ci
* Update llama.cpp
Co-authored-by: slaren <slarengh@gmail.com>
* server : remove LLAMA_MAX_DEVICES
ggml-ci
* llama : remove LLAMA_SUPPORTS_GPU_OFFLOAD
ggml-ci
* train : remove LLAMA_SUPPORTS_GPU_OFFLOAD
* readme : add deprecation notice
* readme : change deprecation notice to "remove" and fix url
* llama : remove gpu includes from llama.h
ggml-ci
---------
Co-authored-by: slaren <slarengh@gmail.com>
|
|
|