Age | Commit message (Collapse) | Author |
|
|
|
* quantize: be able to override metadata by key
* minor : spacing
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* embedding: assign `n_ubatch` value, print error on `n_batch` overflow
* Update examples/embedding/embedding.cpp
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* use %ld instead of %lld
* Revert "use %ld instead of %lld"
This reverts commit ea753ede90a86a0699f65878cc8e2020ff5eabb8.
---------
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
|
|
|
|
* Symlink to /usr/bin/xcrun so that `xcrun` binary
is usable during build (used for compiling Metal shaders)
Fixes https://github.com/ggerganov/llama.cpp/issues/6117
* cmake - copy default.metallib to install directory
When metal files are compiled to default.metallib, Cmake needs to add this to the install directory so that it's visible to llama-cpp
Also, update package.nix to use absolute path for default.metallib (it's not finding the bundle)
* add `precompileMetalShaders` flag (defaults to false) to disable precompilation of metal shader
Precompilation requires Xcode to be installed and requires disable sandbox on nix-darwin
|
|
|
|
Since no blas was provided to buildInputs, the executable is built without blas support.
This is a backport of NixOS/nixpkgs#298567
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Flake lock file updates:
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/d691274a972b3165335d261cc4671335f5c67de9' (2024-03-14)
→ 'github:NixOS/nixpkgs/44d0940ea560dee511026a53f0e2e2cde489b4d4' (2024-03-23)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
|
|
|
|
|
|
* server: clean up oai parsing function
* fix response_format
* fix empty response_format
* minor fixes
* add TODO for logprobs
* update docs
|
|
* fix LOG() error for SYCL, enhance erro check by CI
* rollback to bash
* add newline at end of file
|
|
* add `retrieval` example
* add README
* minor fixes
* cast filepos on print
* remove use of variable sized array
* store similarities in separate vector
* print error on insufficient batch size
* fix error message printing
* assign n_batch value to n_ubatch
* fix param definitions
* define retrieval-only parameters in retrieval.cpp
* fix `--context-file` option to be provided multiple times for multiple files
* use vector for `query_emb`
* add usage description in README
* fix merge conflict
* fix usage printing
* remove seed setting
* fix lint
* increase file read buffer size
* retrieval : minor
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
This change causes some quants (e.g. Q4_0, Q8_0) to go faster on some
architectures (e.g. AMD Zen 4).
|
|
* would throw error on VS2022 on GGML_FREE(wmode)
* wchar_t is usually 2 bytes, but malloc wants bytes
* therefore `*wmode_p++ = (wchar_t)*mode;` could write off the end of the allocation
* Fixes error possibly introduced by https://github.com/ggerganov/llama.cpp/pull/6248
|
|
* imatrix : fix wname for mul_mat_id ops
* also filter tensor names in mul_mat_id ops
---------
Co-authored-by: slaren <slarengh@gmail.com>
|
|
|
|
|
|
* sampling: remove duplicated code for probability distribution access
* free original_logits
* fix original_logits allocation
* fixes based on review @cebtenzzre
* change function name to `llama_sampling_prepare`
|
|
* remove no USM methods
* leave the schedule to ggml_backend_sched entirely
|
|
* support release win
* fix value
* fix value
* fix value
* fix error
* fix error
* fix format
|
|
also fix missing #defines before windows.h, and BPE LF token on MSVC
|
|
|
|
* llama: llama_split_prefix fix strncpy does not include string termination
common: llama_load_model_from_url:
- fix header name case sensitive
- support downloading additional split in parallel
- hide password in url
* common: EOL EOF
* common: remove redundant LLAMA_CURL_MAX_PATH_LENGTH definition
* common: change max url max length
* common: minor comment
* server: support HF URL options
* llama: llama_model_loader fix log
* common: use a constant for max url length
* common: clean up curl if file cannot be loaded in gguf
* server: tests: add split tests, and HF options params
* common: move llama_download_hide_password_in_url inside llama_download_file as a lambda
* server: tests: enable back Release test on PR
* spacing
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* spacing
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* spacing
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
(#6254)
|
|
* Add support for Grok model architecture
* Revert convert-hf-to-gguf to default options
* Fixed f_norm_rms_eps bug
* Fix whitespaces
* llama : fix grok rope type
* llama : minor
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
|
|
* lookup: evaluation tools, use corpus/previous gens
* fixup! lookup: evaluation tools, use corpus/previous gens
* fixup! lookup: evaluation tools, use corpus/previous gens
* fixup! lookup: evaluation tools, use corpus/previous gens
* fixup! lookup: evaluation tools, use corpus/previous gens
|
|
|
|
* convert-llama2c-to-ggml: enable conversion of multiqueries, #5608
* add test in build action
* Update build.yml
* Update build.yml
* Update build.yml
* gg patch
|
|
* quantize: be able to specify the output tensor type
* quantize: be able to specify the token embedding tensor type
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* split: support in llama_model_loader
* avoid copying the entire vector
Co-authored-by: slaren <slarengh@gmail.com>
* split: move llama_tensor_offset to llama_model_loader
* llama_model_loader: PR feedbacks:
- use only one gguf_context for metadata only
- store all ggml_context in a vector as the files and mappings
- store all weights in a vector along with the source tensor
- rename ctx_gguf to meta
- rename ctx_meta to contexts
* avoid copying the entire vector
* Simplify this by making these optional, switch some layer creation tensor optional
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Handle optional tensors
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* llama_model_loader: fail if backend cannot allocate buffer
* fix mmap buffer management
* llama_model_loader: map file to backend buffer if the allocation succeeds only
* llama_model_loader: only map tensors included in the context
* llama_model_loader: minor, use same variable name for consistency, fix spacing in types cast
* llama_model_loader: fail if any of backend buffer cannot be allocated
* spacing
Co-authored-by: slaren <slarengh@gmail.com>
* fix loop over pointer
Co-authored-by: slaren <slarengh@gmail.com>
* llama_model_loader: if n_tensors declared not equals to loaded tensors in split, throw an exception instead of asserting
* llama_model_loader: ensure mappings vector has the expected size
* llama_model_loader: use at instead of operator[] if this should never add to the map.
* llama_model_loader: immediately add the backend buffer to the model buffers in order to free them if an error occurs in the next allocation. Reserve the expected size.
* llama_model_loader: be sure the model mappings has enough capacity before allocating backend buffer
* llama_model_loader: fix map -> unordered map
* llama_split_prefix: use a clearer version, not pass split path len but dest max len.
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* llama : minor
ggml-ci
* llama : introduce some typedef helpers
* docs: add model shard in hot topic
* llama_model_loader: put mapping in a unique_ptr from the moment it is allocated
Co-authored-by: slaren <slarengh@gmail.com>
* fix llama_split_prefix
---------
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
|
|
|
|
* common : add HF arg helpers
* common : remove defaults
|
|
IQ3_XS was not mentioned, IQ3_S and IQ3_M were present twice.
That PR corrects this in the manner which was probably intended initially.
|
|
* json: only attempt python & node schema conversion tests if their bins are present
Tests introduced in https://github.com/ggerganov/llama.cpp/pull/5978
disabled in https://github.com/ggerganov/llama.cpp/pull/6198
* json: orange warnings when tests skipped
* json: ensure py/js schema conv tested on ubuntu-focal-make
* json: print env vars in test
|
|
* json: ordered json in server/schema converter to respect orig order
* json: ws nits
* json: support non-string const / enums
|
|
* cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy
* add LLAMA_CUDA_NO_PEER_COPY to HIP build
|
|
|
|
|
|
|
|
* metal : proper assert for mat-mat memory alignment
ggml-ci
* readme : add notice about the bug fix
* metal : fix the fix
ggml-ci
|
|
|
|
* metal : require ne00 >= 128 for mat-mat kernels
ggml-ci
* llama : pad n_ctx by 32
ggml-ci
|
|
|
|
* Fix params underscore convert to dash.
* Update common/common.cpp
---------
Co-authored-by: slaren <slarengh@gmail.com>
|
|
|
|
|