| Age | Commit message (Collapse) | Author |
|
* iq1_m: basics
* iq1_m: basics-2
* iq1_m: CUDA dequantize works
Very 1st shot I get PPL = 9.76 for LLaMA-v2-7B.
* iq1_m: separate shifts for each group of 8 in a block
We get
PPL(LLaMA-v2-7B ) = 9.2810
PPL(LLaMA-v2-13B) = 6.8105
Not bad, but slightly higher than
sqrt(PPL(IQ1_S) * PPL(IQ2_XXS))
which is the expected outcome given that IQ1_M is
halfway between IQ1_S and IQ2_XXS in terms of bpw.
From this, we would expect
PPL = 9.14 for LLaMA-v2-7B
PPL = 6.63 for LLaMA-v2-13B
* iq1_m: go to 3-bit scales
There is slight increase in PPL, but the 0.0625 bpw reduction
in size is totally worth it.
We now have
PPL(LLaMA-v2-7B ) = 9.4469 at 1.96 bpw
PPL(LLaMA-v2-13B) = 6.8717 at 1.93 bpw
PPL(LLaMA-v2-70B) = 4.8568 at 1.85 bpw
* iq1_m: scalar dot product
* iq1_m: AVX2 dot product
* iq1_m: very slightly faster AVX2 dot product
* iq1_m: ARM_NEON dot product
Works, but very slow (10.5 t/s)
* iq1_m: Metal - dequantize works, dot product does not
* iq1_m: Metal now works
About the same performance as iq1_s.
* iq1_m: minor
* iq1_m: checking pure iq1_m quantization
It is pretty bad: PPL(LLaMA-v2-7B) = 34 if we quantize output.weight
with Q4_K.
* iiq1_m: slightly faster ARM_NEON dot product
10.5 t/s -> 11.65 t/s
* iq1_m: faster ARM_NEON dot product
11.65 t/s -> 14.9 t/s
* iq1_m: another minor ARM_NEON dot product improvement
14.9 -> 15.0 t/s
* iq1_m: small PPL improvement via super-block scale adjustment
After quantizing block scales redo the super-block scale fit.
PPL(LLaMA-v2-7B ) = 9.3346
PPL(LLaMA-v2-13B) = 6.8419
PPL(LLaMA-v2-70B) = 4.8294
PPL(Mistral-7B ) = 8.1624
* iq1_m: adapt to CUDA refactoring
* iq1_m: remove unused variable
We have progressed to warnings being errors.
* iq1_m: add to backend-ops tests
* iq1_m: fix Windows ARM
* iq1_m: use common definition of iq1m_scale_t
* cuda: assert -> NO_DEVICE_CODE
* iq1_M: PR comments
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
* quantize: be able to override metadata by key
* minor : spacing
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* embedding: assign `n_ubatch` value, print error on `n_batch` overflow
* Update examples/embedding/embedding.cpp
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* use %ld instead of %lld
* Revert "use %ld instead of %lld"
This reverts commit ea753ede90a86a0699f65878cc8e2020ff5eabb8.
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
|
|
|
|
* Symlink to /usr/bin/xcrun so that `xcrun` binary
is usable during build (used for compiling Metal shaders)
Fixes https://github.com/ggerganov/llama.cpp/issues/6117
* cmake - copy default.metallib to install directory
When metal files are compiled to default.metallib, Cmake needs to add this to the install directory so that it's visible to llama-cpp
Also, update package.nix to use absolute path for default.metallib (it's not finding the bundle)
* add `precompileMetalShaders` flag (defaults to false) to disable precompilation of metal shader
Precompilation requires Xcode to be installed and requires disable sandbox on nix-darwin
|
|
|
|
Since no blas was provided to buildInputs, the executable is built without blas support.
This is a backport of NixOS/nixpkgs#298567
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Flake lock file updates:
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/d691274a972b3165335d261cc4671335f5c67de9' (2024-03-14)
→ 'github:NixOS/nixpkgs/44d0940ea560dee511026a53f0e2e2cde489b4d4' (2024-03-23)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
|
|
|
|
|
|
* server: clean up oai parsing function
* fix response_format
* fix empty response_format
* minor fixes
* add TODO for logprobs
* update docs
|
|
* fix LOG() error for SYCL, enhance erro check by CI
* rollback to bash
* add newline at end of file
|
|
* add `retrieval` example
* add README
* minor fixes
* cast filepos on print
* remove use of variable sized array
* store similarities in separate vector
* print error on insufficient batch size
* fix error message printing
* assign n_batch value to n_ubatch
* fix param definitions
* define retrieval-only parameters in retrieval.cpp
* fix `--context-file` option to be provided multiple times for multiple files
* use vector for `query_emb`
* add usage description in README
* fix merge conflict
* fix usage printing
* remove seed setting
* fix lint
* increase file read buffer size
* retrieval : minor
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
This change causes some quants (e.g. Q4_0, Q8_0) to go faster on some
architectures (e.g. AMD Zen 4).
|
|
* would throw error on VS2022 on GGML_FREE(wmode)
* wchar_t is usually 2 bytes, but malloc wants bytes
* therefore `*wmode_p++ = (wchar_t)*mode;` could write off the end of the allocation
* Fixes error possibly introduced by https://github.com/ggerganov/llama.cpp/pull/6248
|
|
* imatrix : fix wname for mul_mat_id ops
* also filter tensor names in mul_mat_id ops
---------
Co-authored-by: slaren <slarengh@gmail.com>
|
|
|
|
|
|
* sampling: remove duplicated code for probability distribution access
* free original_logits
* fix original_logits allocation
* fixes based on review @cebtenzzre
* change function name to `llama_sampling_prepare`
|
|
* remove no USM methods
* leave the schedule to ggml_backend_sched entirely
|
|
* support release win
* fix value
* fix value
* fix value
* fix error
* fix error
* fix format
|
|
also fix missing #defines before windows.h, and BPE LF token on MSVC
|
|
|
|
* llama: llama_split_prefix fix strncpy does not include string termination
common: llama_load_model_from_url:
- fix header name case sensitive
- support downloading additional split in parallel
- hide password in url
* common: EOL EOF
* common: remove redundant LLAMA_CURL_MAX_PATH_LENGTH definition
* common: change max url max length
* common: minor comment
* server: support HF URL options
* llama: llama_model_loader fix log
* common: use a constant for max url length
* common: clean up curl if file cannot be loaded in gguf
* server: tests: add split tests, and HF options params
* common: move llama_download_hide_password_in_url inside llama_download_file as a lambda
* server: tests: enable back Release test on PR
* spacing
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* spacing
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* spacing
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
(#6254)
|
|
* Add support for Grok model architecture
* Revert convert-hf-to-gguf to default options
* Fixed f_norm_rms_eps bug
* Fix whitespaces
* llama : fix grok rope type
* llama : minor
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
|
|
* lookup: evaluation tools, use corpus/previous gens
* fixup! lookup: evaluation tools, use corpus/previous gens
* fixup! lookup: evaluation tools, use corpus/previous gens
* fixup! lookup: evaluation tools, use corpus/previous gens
* fixup! lookup: evaluation tools, use corpus/previous gens
|
|
|
|
* convert-llama2c-to-ggml: enable conversion of multiqueries, #5608
* add test in build action
* Update build.yml
* Update build.yml
* Update build.yml
* gg patch
|
|
* quantize: be able to specify the output tensor type
* quantize: be able to specify the token embedding tensor type
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* split: support in llama_model_loader
* avoid copying the entire vector
Co-authored-by: slaren <slarengh@gmail.com>
* split: move llama_tensor_offset to llama_model_loader
* llama_model_loader: PR feedbacks:
- use only one gguf_context for metadata only
- store all ggml_context in a vector as the files and mappings
- store all weights in a vector along with the source tensor
- rename ctx_gguf to meta
- rename ctx_meta to contexts
* avoid copying the entire vector
* Simplify this by making these optional, switch some layer creation tensor optional
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Handle optional tensors
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* llama_model_loader: fail if backend cannot allocate buffer
* fix mmap buffer management
* llama_model_loader: map file to backend buffer if the allocation succeeds only
* llama_model_loader: only map tensors included in the context
* llama_model_loader: minor, use same variable name for consistency, fix spacing in types cast
* llama_model_loader: fail if any of backend buffer cannot be allocated
* spacing
Co-authored-by: slaren <slarengh@gmail.com>
* fix loop over pointer
Co-authored-by: slaren <slarengh@gmail.com>
* llama_model_loader: if n_tensors declared not equals to loaded tensors in split, throw an exception instead of asserting
* llama_model_loader: ensure mappings vector has the expected size
* llama_model_loader: use at instead of operator[] if this should never add to the map.
* llama_model_loader: immediately add the backend buffer to the model buffers in order to free them if an error occurs in the next allocation. Reserve the expected size.
* llama_model_loader: be sure the model mappings has enough capacity before allocating backend buffer
* llama_model_loader: fix map -> unordered map
* llama_split_prefix: use a clearer version, not pass split path len but dest max len.
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* llama : minor
ggml-ci
* llama : introduce some typedef helpers
* docs: add model shard in hot topic
* llama_model_loader: put mapping in a unique_ptr from the moment it is allocated
Co-authored-by: slaren <slarengh@gmail.com>
* fix llama_split_prefix
---------
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
|
|
|
|
* common : add HF arg helpers
* common : remove defaults
|
|
IQ3_XS was not mentioned, IQ3_S and IQ3_M were present twice.
That PR corrects this in the manner which was probably intended initially.
|
|
* json: only attempt python & node schema conversion tests if their bins are present
Tests introduced in https://github.com/ggerganov/llama.cpp/pull/5978
disabled in https://github.com/ggerganov/llama.cpp/pull/6198
* json: orange warnings when tests skipped
* json: ensure py/js schema conv tested on ubuntu-focal-make
* json: print env vars in test
|
|
* json: ordered json in server/schema converter to respect orig order
* json: ws nits
* json: support non-string const / enums
|
|
* cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy
* add LLAMA_CUDA_NO_PEER_COPY to HIP build
|
|
|
|
|
|
|
|
* metal : proper assert for mat-mat memory alignment
ggml-ci
* readme : add notice about the bug fix
* metal : fix the fix
ggml-ci
|
|
|
|
* metal : require ne00 >= 128 for mat-mat kernels
ggml-ci
* llama : pad n_ctx by 32
ggml-ci
|
|
|
|
* Fix params underscore convert to dash.
* Update common/common.cpp
---------
Co-authored-by: slaren <slarengh@gmail.com>
|
|
|