Age | Commit message (Collapse) | Author |
|
* updated server readme to reflect the gg/server-token-probs-4088 commit
added explanation for the API's completion result which now includes `completion_probabilities`. Also added a JSON schema that shows the type/structure of `completion_probabilities`.
* simplified the `completion_probabilities` JSON schema
It's now easier to understand what the structure of `completion_probabilities` looks like.
* minor : fix trailing whitespace
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
Document the api-key flag added to server in https://github.com/ggerganov/llama.cpp/pull/4441
|
|
* ggml : fix vld1q_s8_x4 32-bit compat
ggml-ci
* ggml : fix 32-bit ARM compat (cont)
ggml-ci
|
|
|
|
See https://github.com/ggerganov/llama.cpp/blob/master/common/common.cpp#L230C53-L230C57
|
|
|
|
* iq2_xxs: basics
* iq2_xxs: scalar and AVX2 dot products
Needed to change Q8_K to have quants in the -127...127 range,
else the IQ2_XXS AVX implementation becomes very awkward.
The alternative would have been to use Q8_0 instead. Perhaps
I'll change later, for now this is what we have.
* iq2_xxs: ARM_NEON dot product
Somehow strangely slow (112 ms/token).
* iq2_xxs: WIP Metal
Dequantize works, something is still wrong with the
dot product.
* iq2_xxs: Metal dot product now works
We have
PP-512 = 475 t/s
TG-128 = 47.3 t/s
Not the greatest performance, but not complete garbage either.
* iq2_xxs: slighty faster dot product
TG-128 is now 48.4 t/s
* iq2_xxs: slighty faster dot product
TG-128 is now 50.9 t/s
* iq2_xxs: even faster Metal dot product
TG-128 is now 54.1 t/s.
Strangely enough, putting the signs lookup table
into shared memory has a bigger impact than the
grid values being in shared memory.
* iq2_xxs: dequantize CUDA kernel - fix conflict with master
* iq2_xxs: quantized CUDA dot product (MMVQ)
We get TG-128 = 153.1 t/s
* iq2_xxs: slightly faster CUDA dot product
TG-128 is now at 155.1 t/s.
* iq2_xxs: add to llama ftype enum
* iq2_xxs: fix MoE on Metal
* Fix missing MMQ ops when on hipBLAS
I had put the ggml_supports_mmq call at the wrong place.
* Fix bug in qequantize_row_iq2_xxs
The 0.25f factor was missing.
Great detective work by @ggerganov!
* Fixing tests
* PR suggestion
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
|
|
* examples : add passkey test
* passkey : better prints
* passkey : select pass key pos from CLI
* passkey : simplify n_past logic
* llama : "self-extend"-like context extension
* passkey : add comment
* main : add Self-Extend support
* llama : add comment about llama_kv_cache_seq_div
|
|
* examples : add passkey test
* passkey : better prints
* passkey : select pass key pos from CLI
* passkey : simplify n_past logic
* make : add passkey target
* passkey : add "self-extend"-like context extension (#4810)
* llama : "self-extend"-like context extension
* passkey : add comment
* passkey : add readme
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
openblas v0.3.22 64-bit pkg-config file is named openblas64.pc
https://github.com/OpenMathLib/OpenBLAS/issues/3790
|
|
betwen -> between
|
|
ggml-ci
|
|
|
|
|
|
* ggml : do not sched_yield when calling BLAS
ggml-ci
* ggml : fix do_yield logic
ggml-ci
* ggml : simplify do_yield logic
ggml-ci
|
|
|
|
This commit removes unused includes from finetune.cpp.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
|
|
|
|
|
|
* swiftui: support load model from file picker
* swiftui: remove trailing whitespace
|
|
* fix examples/server/README.md
* minor : fix whitespace
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
* metal: fix metal backend init failure in swiftui
* metal: build ggml.metallib instead of copy src
* llama.swift : remove debug flags from metallib build
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
This commit fixes a typo in the help message for the
--overlapping-samples option.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
|
|
* updates the package.swift to use ggml as dependency
* changes the ggml package url src to ggerganov
|
|
Co-authored-by: slaren <slarengh@gmail.com>
|
|
ggml-ci
|
|
ggml-ci
|
|
ggml-ci
|
|
|
|
* add more int ops
* ggml_compute_forward_dup_bytes
* add tests
* PR comments
* tests : minor indentations
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
* ggml : disable fast-math for Metal (cmake build only)
ggml-ci
* metal : fix Metal API debug warnings
* cmake : add -fno-inline for Metal build (#4545)
* metal : fix API debug warnings
* metal : fix compile warnings
* metal : use uint64_t for strides
* cmake : rename option to LLAMA_METAL_SHADER_DEBUG
* metal : fix mat-vec Q8_0 kernel for BS > 1
* metal : normalize mat-vec kernel signatures
* cmake : respect LLAMA_QKK_64 option
* metal : fix mat-vec Q4_K kernel for QK_K == 64
* metal : optimizing ggml_mul_mat_id (wip)
* metal : minor fix
* metal : opt mul_mm_id
|
|
* server: add token counts to stats
* server: generate hpp
---------
Co-authored-by: phiharri <ph@got-root.co.uk>
|
|
|
|
* replaced all API facing `int`'s with `int32_t`
* formatting and missed `int` in `llama_token_to_piece`
|