Age | Commit message (Collapse) | Author |
|
* Merging mainline - WIP
* Merging mainline - WIP
AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.
* Merging mainline - fix Metal
* Remove check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Only on the files where I have contributed in a significant way,
or the files I wrote myself.
|
|
For x86 slaren was genereous enough to add _mm_pause() to the busy
spin wait loop in ggml_barrier(), but everything else just busy
spins, loading an atomic int on every iteration, thus forcing cache
sync between the cores. This results in a massive drop in performance
on my M2-Max laptop when using 8 threads. The closest approximation
to _mm_pause() on Arm seems to be
__asm__ __volatile__("isb\n");
After adding this to the busy spin loop, performance for 8 threads
recovers back to expected levels.
|
|
K*Q and KQ*V are n_kv_embed x n_token x n_head matrix multiplications.
Before this PR, this meant n_head calls to iqk_mul_mat to perform
n_kv_embed x n_token 2D multiplications, each using nth threads.
Instead, in this PR, if n_head is a multiple of nth, each thread
does n_head/nth multiplications of the n_kv_embed x n_token 2D matrices.
This improves PP-512(32 threads) for Bitnet-3B to 433 t/s up from
409 t/s. It is beneficial in other cases too. E.g., for LLaMA-7B,
we go to 201 t/s up from 193 t/s for q4_K_S, and to 144 t/s up from
139 t/s for fp16. All these numbers are for the Ryzen-7950X CPU.
|
|
|
|
Also save one scale operation in the ffn network by adjusting
rms_eps. We gain up to 3% in performance by doing this, but it
is a bit of a hack (we store the tensor scales in op_params
while loading the model).
|
|
|
|
The scalar dot product already chieves 37 t/s for TG!
|
|
|
|
* Remove iqk_mul_mat from llamafile_sgemm
* Pass tensor types and strides to iqk_mul_mat
It is marked WIP because only tested on __aarch64__
|
|
|
|
|
|
|
|
|
|
|
|
|
|
* move BLAS to a separate backend
* rename GGML_USE_OPENBLAS to GGML_USE_BLAS
* alloc : reuse same buffer when the same buffer type if used multiple times
* set number of threads automatically for openblas and blis
* sched : print assignments when GGML_SCHED_DEBUG env variable is set
* sched : allow ops with weights on an incompatible buffer type
This will cause the weight to be copied to a backend that supports the
op, which is very costly. The weight should have been stored in a buffer
of a backend that can run the op, but llama.cpp cannot do this
automatically at the moment.
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* tests : add non-cont unary tests
* ggml : update unary asserts and "supports_op"
ggml-ci
|
|
* ggml : improve ggml_is_contiguous logic
ggml-ci
* ggml : support more contiguous cases
ggml-ci
|
|
* ggml : unify rope norm/neox (CPU)
* ggml : fix compile warning
* ggml : remove GLM rope mode
ggml-ci
* metal : better rope implementation
ggml-ci
* cuda : better rope implementation
ggml-ci
* naming : n_orig_ctx -> n_ctx_orig
ggml-ci
* dev : add reminders to update backends
ggml-ci
* vulkan : fix ggml_rope_ext() usage
* cuda : fix array size + indents
ggml-ci
|
|
ggml-ci
|
|
This enforces a check that -fno-finite-math-only was set and that the operating
compiling mode is not in finite maths mode. This is because during rewriting of
silu and softmax for cpu #7154 there emerged an issue where the result that was
observed when >1 slot was nondeterministic as found by @JohannesGaessler.
@LostRuins narrowed the problem down to -ffinite-math-only which was theorised
to be due to SiLU, instead of flushing small values to 0, returns NaN or some
other garbage. @jart proposed a fix that @ggerganov then implemented in this fix
ref https://github.com/ggerganov/llama.cpp/pull/7154#issuecomment-2145661825
|
|
* ggml: Added OpenMP for multi-threads processing
* ggml : Limit the number of threads used to avoid deadlock
* update shared state n_threads in parallel region
* clear numa affinity for main thread even with openmp
* enable openmp by default
* fix msvc build
* disable openmp on macos
* ci : disable openmp with thread sanitizer
* Update ggml.c
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* ggml : fix loongson compile warnings
ggml-ci
* Fix loongarch quantize test fail.
Fix unexpected error introduced during rebase code.
* tests : disable json test due to lack of python on the CI node
ggml-ci
---------
Co-authored-by: junchao-loongson <zhaojunchao@loongson.cn>
|
|
* faster avx512 exp implementation
* x->r
* improve accuracy, handle special cases
* remove `e`
|
|
|
|
* tests : add rope tests
ggml-ci
* ggml : fixes (hopefully)
ggml-ci
* tests : add non-cont tests
ggml-ci
* cuda : add asserts for rope/norm + fix DS2
ggml-ci
* ggml : assert contiguousness
* tests : reduce RoPE tests
ggml-ci
|
|
|
|
* ggml : use atomic_flag for critical section
* add windows shims
|
|
ggml-ci
|
|
|
|
* ggml : generalize GGML_OP_CONCAT (WIP)
ggml-ci
* tests : add dim != 2 tests
* metal : generalize concat kernel
* tests : naming
* cuda : generalize concat kernel
ggml-ci
* sycl : add warning and assert
* ggml : fix op params handling
* metal : bugfix kernel
ggml-ci
* ggml : reimplement CPU and Metal
* cuda : add asserts
ggml-ci
* ggml : fix ptrs
ggml-ci
|
|
* Add SVE support for q4_0_q8_0 q8_0_q8_0
* remove ifdef
|
|
ggml-ci
|
|
* ggml : drop support for QK_K=64
ggml-ci
* opencl : restore QK_K=256 define
|
|
* cuda : fix rope pos data
ggml-ci
* ggml : drop mode & 1 == 1 support for ggml_rope
ggml-ci
* ggml : support freq_factors for f16 rope (CPU)
ggml-ci
* tests : add rope tests using frequency factors
ggml-ci
|
|
* add phi3 128k support in convert-hf-to-gguf
* add phi3 128k support in cuda
* address build warnings on llama.cpp
* adjust index value in cuda long rope freq factors
* add long rope support in ggml cpu backend
* make freq factors only depend on ctx size
* remove unused rope scaling type 'su' frin gguf converter
* fix flint warnings on convert-hf-to-gguf.py
* set to the short freq factor when context size is small than trained context size
* add one line of comments
* metal : support rope freq_factors
* ggml : update ggml_rope_ext API to support freq. factors
* backends : add dev messages to support rope freq. factors
* minor : style
* tests : update to use new rope API
* backends : fix pragma semicolons
* minor : cleanup
* llama : move rope factors from KV header to tensors
* llama : remove tmp assert
* cuda : fix compile warning
* convert : read/write n_head_kv
* llama : fix uninitialized tensors
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* add loongarch lsx and lasx optimize code
* Add loongarch compilation support to makefile
* revert stb_image.h
* opt bytes_from_nibbles_32 and sum_i16_pairs_float
* fix undeclared
* format code
* update
* update 2
---------
Co-authored-by: Jinyang He <hejinyang@loongson.cn>
|
|
for enabling AVX512_BF16 (#7258)
|
|
|
|
* android : use "ci-android" branch for CI
* ggml : disable SIMD exp and silu for 32-bit ARM
ggml-ci
* android : do not fetch, use add_subdirectory instead
* cmake : provide binary dir
|
|
This change upstreams llamafile's vectorized expf() functions. This lets
us compute softmax and silu more accurately than the short[65536] lookup
table that GGML previously used to make this operation go faster. We can
support aarch64 and sse2+ with the worst case rounding error of 2ulp. It
makes make -j8 tests && ./tests/test-backend-ops -o SOFT_MAX -b CPU perf
go 1.5x faster for SSE2+FMA, 1.9x faster for AVX2+FMA and 2.1x on AVX512
|
|
* Just reordering some structs.
* Adding in the calls to mm_pause
* Passing around the state
* Renaming and moving a bunch of variables around.
* Extracting the logic to it's own function.
* Moving some variable definitions into the chunk function.
* Moving some variables around
* moving src1_cont inside
* Moving row_size
* adding the current_chunk
* Reorg the code.
* Formatting to match the orig patch
* starting to setup the chunking variables
* Starting the buildup of the loop
* The yield shouldn't be necessary.
* adding the looping structure based on the chunk configuration.
* Add in the re-chunking code.
* Making it much more likely to rechunk.
* disable resizing if numa is enabled.
* Updating comments with what we've learned.
* Fix formatting
* Couple more formatting fixes.
* More style fixes.
* Fix Warnings
* Going with unused because there's conditional logic that needs it.
* Update ggml.c
* Update ggml.c
---------
|
|
|
|
* initial commit with CPU implementation of upscale to shape and test, cuda implementation next
* experimental commit to see if dst shape is correct
* test version
* test
* removed unnecessary params
* refactor
* fixed tests
* ggml : metal impl + cleanup + sycl dev warnings
* patched ggml_upscale cuda op to handle non-contiguous tensors, added test for non-contiguous behavior
* metal : fix upsacle op to support nb00 + style
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* ggml : fa without mask + add asserts
ggml-ci
* metal : support non-contiguous KV
ggml-ci
|
|
|
|
ggml-ci
|
|
* added sigmoid function
* implemented metal kernel for sigmoid
* implemented cuda kernel for sigmoid
* added sigmoid unary op and incremented count
|
|
* ggml : full ALiBi support
* ggml : update ggml_soft_max_ext() CUDA, SYCL
* ggml : ggml_flash_attn_ext() support ALiBi (CPU)
* ggml : ggml_flash_attn_ext() support ALiBi (Metal)
* ggml : fix warning
* ggml : ggml_flash_attn_ext() support ALiBi (CUDA)
ggml-ci
* ggml : fix assert message
* vulkan : add dev notes
* ggml : require mask when using ALiBi
ggml-ci
* convert : fix convert for refact models
|