Age | Commit message (Collapse) | Author |
|
From here on switching to GCC 12.
PP-512 is now 139.3 t/s.
TG-128 is 13.5 t/s @ 4 threads
23.0 t/s @ 8 threads
25.1 t/s @ 16 threads
|
|
2.41X for PP-512 (120.5 t/s).
Slightly faster for TG @ 4 threads (12.2 t/s vs 11.9 t/s).
But somehow slower at 16 threads - 22.65 t/s vs 26.3 t/s.
Very strange.
|
|
2.09X for PP-512 (104.7 t/s), worse than mainline for TG.
I think it needs more work.
|
|
We get 2.19X for PP-512 (118.9 t/s). TG is mostly OK
(slightly better @ 4 threads, slightly worse @ 16 threads).
|
|
We get 2.04X for PP-512 (107 t/s). TG againsuffers
a small loss in performance (19.9 t/s vs 21.4 t/s @ 16 threads)
|
|
|
|
We get 2.3X for PP-512 (87 t/s). But for TG, we need to use
the original implementation in llama.cpp because the template is not able
to match the performance of the special-purpose implementation.
Also, 87 t/s is significantly lower than the 111 t/s I have in iquants.
|
|
We get 3.14X for PP-512 (96.6 t/s). But for TG, we need to use
the original implementation in llama.cpp because the template is not able
to match the performance of the special-purpose implementation.
|
|
Still missing iq1_s and iq1_m, but I don't think I'll do those.
|
|
Here we get 3.65X (!) for PP-512 (53 t/s).
|
|
|
|
We get 2.66X for PP-512 (42.35 t/s)
|
|
We get 2.2X for PP-512 (52 t/s)
|
|
We get only a 2.07X for PP-512 to get up to 31 t/s,
so iq2_s remains slow.
|
|
|
|
|
|
We get ~5% speeedup for TG-128, 3X for PP-512
|
|
We get 31 t/s up from 26 t/s, but we need to treat
PP differently from TG, else we get a ~10% drop in
PP performance.
|
|
|
|
* Adding simple bare-bones test for end-to-end integration test for json validation against auto-generated JSON-schema grammars.
* Adding additional examples as documented in #7789 . Also adding the ability to automatically output improperly failing grammars to debug output files so they can more easily be examined in the gbnf-validator program.
* Uncommenting formerly commented tests so that they fail for others who are attempting to reproduce the bugs.
* Merging improved schema test methods added by @ochafik in #7797
* Adding #define to temporarily remove failing tests so that this PR can pass CI, but still be useful for other PRs that want to leverage the framework.
* Fixing nits from ochafik. Removing escape slashes, adding additional failing cases, fixing some other strings.
* Fixing grammar indentation to be consistent throughout file.
|
|
* vulkan: detect multiple devices by deviceUUID instead of deviceID
* vulkan: remove unneeded variables
* vulkan: fix id query
|
|
* initial iq4_xs
* fix ci
* iq4_nl
* iq1_m
* iq1_s
* iq2_xxs
* iq3_xxs
* iq2_s
* iq2_xs
* iq3_s before sllv
* iq3_s
* iq3_s small fix
* iq3_s sllv can be safely replaced with sse multiply
|
|
ggml-ci
|
|
* create append_pooling operation; allow to specify attention_type; add last token pooling; update examples
* find result_norm/result_embd tensors properly; update output allocation logic
* only use embd output for pooling_type NONE
* get rid of old causal_attn accessor
* take out attention_type; add in llama_set_embeddings
* bypass logits when doing non-NONE pooling
|
|
|
|
|
|
|
|
* common: fix warning
* Update common/common.cpp
Co-authored-by: slaren <slarengh@gmail.com>
---------
Co-authored-by: slaren <slarengh@gmail.com>
|
|
* add sycl preset
* fix debug link error. fix windows crash
* update README
|
|
* CUDA: stream-k decomposition for MMQ
* fix undefined memory reads for small matrices
|
|
Currently the Metal backend does not support BF16. `ggml_metal_supports_op` was returning true in these cases, leading to a crash with models converted with `--leave-output-tensor`. This commit checks if the first few sources types are BF16 and returns false if that's the case.
|
|
|
|
* un-ignore `build-info.cmake` and `build-info.sh`
I am assuming that ignoring them was unintentional. If they are ignored, some tools, like cargo, will consider the files inexistent, even if they're comitted, for the purpose of publishing. This leads to the build failing in such cases.
* un-ignore `build-info.cpp.in`
For the same reason as the previous two files.
* Reorganize `.gitignore`
* Add exceptions for files mentioned by @slaren
I did leave .clang-tidy since it was explicitly ignored before.
* Add comments for organization
* Sort some lines for pretty
* Test with `make` and `cmake` builds to ensure no build artifacts might be comitted
* Remove `.clang-tidy` from `.gitignore`
Per comment by @ggerganov
* Remove `IDEWorkspaceChecks.plist` from root-level `.gitignore`
|
|
|
|
|
|
* seperate lower precision GEMM from the main files
* fix workgroup size hardcode
|
|
* Random test: add_bos_token, add_eos_token
* Random test: add BPE models for testing
* Custom regex split fails with codepoint 0
* Fix falcon punctuation regex
* Refactor llm_tokenizer_bpe: move code to constructor
* Move 'add_special_bos/eos' logic to llm_tokenizer_bpe
* Move tokenizer flags to vocab structure.
* Default values for special_add_bos/eos
* Build vocab.special_tokens_cache using vocab token types
* Generalize 'jina-v2' per token attributes
* Fix unicode whitespaces (deepseek-coder, deepseek-llm)
* Skip missing byte tokens (falcon)
* Better unicode data generation
* Replace char32_t with uint32_t
|
|
* Only use FIM middle if it exists
* Only use FIM middle if it exists
|
|
|
|
On hosts which are not prepared/dedicated to execute code using CUDA
it is still possible to compile llama.cpp with CUDA support by just
installing the development packages. Missing are the runtime
libraries like /usr/lib64/libcuda.so* and currently the link step
will fail.
The development environment is prepared for such situations. There
are stub libraries for all the CUDA libraries available in the
$(CUDA_PATH)/lib64/stubs directory. Adding this directory to the end
of the search path will not change anything for environments which
currently work fine but will enable compiling llama.cpp also in case
the runtime code is not available.
|
|
Signed-off-by: thxCode <thxcode0824@gmail.com>
|
|
|
|
|
|
* whisper : use ggml_backend_sched (wip)
* use sched in whisper_allocr
* whisper : single backend in whisper_context
* whisper : remove whisper_state->backends_used
* whisper : remove whisper_context->backend
* whisper : reset scheduler after init
* whisper : fix external encoder (e.g. CoreML)
* whisper : cleanup
* whisper : handle null GPU buffer types + fix sycl
---------
Co-authored-by: slaren <slarengh@gmail.com>
|
|
* update: convert-hf-to-gguf.py to support Qwen2-57B-A14B
* fix: QWEN2MOE support for expert_feed_forward_length
previously, expert ff was taken from n_ff (intermediate size) but it is now properly taken from LLM_KV_EXPERT_FEED_FORWARD_LENGTH
n_ff_exp and n_ff_shared_exp are now properly calculated
* update: convert-hf-to-gguf.py cleanup for Qwen2MoeForCausalLM
* fix: QWEN2MOE support for expert_feed_forward_length
previously, expert ff was taken from n_ff (intermediate size) but it is now properly taken from LLM_KV_EXPERT_FEED_FORWARD_LENGTH
n_ff_exp and n_ff_shexp are now properly calculated
|
|
|
|
|
|
|
|
|
|
Signed-off-by: thxCode <thxcode0824@gmail.com>
|