summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-03-26embedding : adjust `n_ubatch` value (#6296)Minsoo Cheong
* embedding: assign `n_ubatch` value, print error on `n_batch` overflow * Update examples/embedding/embedding.cpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * use %ld instead of %lld * Revert "use %ld instead of %lld" This reverts commit ea753ede90a86a0699f65878cc8e2020ff5eabb8. --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-03-26server : add `n_discard` parameter (#6300)Jan Boon
2024-03-25nix: make `xcrun` visible in Nix sandbox for precompiling Metal shaders (#6118)Joseph Stahl
* Symlink to /usr/bin/xcrun so that `xcrun` binary is usable during build (used for compiling Metal shaders) Fixes https://github.com/ggerganov/llama.cpp/issues/6117 * cmake - copy default.metallib to install directory When metal files are compiled to default.metallib, Cmake needs to add this to the install directory so that it's visible to llama-cpp Also, update package.nix to use absolute path for default.metallib (it's not finding the bundle) * add `precompileMetalShaders` flag (defaults to false) to disable precompilation of metal shader Precompilation requires Xcode to be installed and requires disable sandbox on nix-darwin
2024-03-26cuda : rename build flag to LLAMA_CUDA (#6299)slaren
2024-03-25nix: fix blas support (#6281)Christian Kögler
Since no blas was provided to buildInputs, the executable is built without blas support. This is a backport of NixOS/nixpkgs#298567
2024-03-25tests : include IQ2_XXS and IQ2_XS in test-quantize-fns (#6303)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-25flake.lock: Update (#6266)Georgi Gerganov
Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/d691274a972b3165335d261cc4671335f5c67de9' (2024-03-14) → 'github:NixOS/nixpkgs/44d0940ea560dee511026a53f0e2e2cde489b4d4' (2024-03-23) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-03-25cuda : fix LLAMA_CUDA_F16 build (#6298)slaren
2024-03-25cuda : refactor into multiple files (#6269)slaren
2024-03-25Server: clean up OAI params parsing function (#6284)Xuan Son Nguyen
* server: clean up oai parsing function * fix response_format * fix empty response_format * minor fixes * add TODO for logprobs * update docs
2024-03-25[SYCL] fix SYCL backend build on windows is break by LOG() error (#6290)Neo Zhang Jianyu
* fix LOG() error for SYCL, enhance erro check by CI * rollback to bash * add newline at end of file
2024-03-25examples : add "retrieval" (#6193)Minsoo Cheong
* add `retrieval` example * add README * minor fixes * cast filepos on print * remove use of variable sized array * store similarities in separate vector * print error on insufficient batch size * fix error message printing * assign n_batch value to n_ubatch * fix param definitions * define retrieval-only parameters in retrieval.cpp * fix `--context-file` option to be provided multiple times for multiple files * use vector for `query_emb` * add usage description in README * fix merge conflict * fix usage printing * remove seed setting * fix lint * increase file read buffer size * retrieval : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-25ggml : support AVX512VNNI (#6280)Justine Tunney
This change causes some quants (e.g. Q4_0, Q8_0) to go faster on some architectures (e.g. AMD Zen 4).
2024-03-24Fix heap corruption from wmode out-of-bound writes on windows (#6272)Rick G
* would throw error on VS2022 on GGML_FREE(wmode) * wchar_t is usually 2 bytes, but malloc wants bytes * therefore `*wmode_p++ = (wchar_t)*mode;` could write off the end of the allocation * Fixes error possibly introduced by https://github.com/ggerganov/llama.cpp/pull/6248
2024-03-24imatrix : fix wname for mul_mat_id ops (#6271)Georgi Gerganov
* imatrix : fix wname for mul_mat_id ops * also filter tensor names in mul_mat_id ops --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-03-24Fixed lookup compilation issues on Windows (#6273)Johannes Gäßler
2024-03-24ci : close inactive issue, increase operations per run (#6270)Pierrick Hymbert
2024-03-24sampling : deduplicated code for probability distribution access (#6240)Minsoo Cheong
* sampling: remove duplicated code for probability distribution access * free original_logits * fix original_logits allocation * fixes based on review @cebtenzzre * change function name to `llama_sampling_prepare`
2024-03-24[SYCL] offload op (#6217)Meng, Hengyu
* remove no USM methods * leave the schedule to ggml_backend_sched entirely
2024-03-24Support build win release for SYCL (#6241)Neo Zhang Jianyu
* support release win * fix value * fix value * fix value * fix error * fix error * fix format
2024-03-23use _wfopen instead of fopen on Windows (#6248)Jared Van Bortel
also fix missing #defines before windows.h, and BPE LF token on MSVC
2024-03-23gitignore : gguf-splitGeorgi Gerganov
2024-03-23common: llama_load_model_from_url split support (#6192)Pierrick Hymbert
* llama: llama_split_prefix fix strncpy does not include string termination common: llama_load_model_from_url: - fix header name case sensitive - support downloading additional split in parallel - hide password in url * common: EOL EOF * common: remove redundant LLAMA_CURL_MAX_PATH_LENGTH definition * common: change max url max length * common: minor comment * server: support HF URL options * llama: llama_model_loader fix log * common: use a constant for max url length * common: clean up curl if file cannot be loaded in gguf * server: tests: add split tests, and HF options params * common: move llama_download_hide_password_in_url inside llama_download_file as a lambda * server: tests: enable back Release test on PR * spacing Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * spacing Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * spacing Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-23server: docs: `--threads` and `--threads`, `--ubatch-size`, `--log-disable` ↵Pierrick Hymbert
(#6254)
2024-03-23llama : add grok-1 support (#6204)Julius Arkenberg
* Add support for Grok model architecture * Revert convert-hf-to-gguf to default options * Fixed f_norm_rms_eps bug * Fix whitespaces * llama : fix grok rope type * llama : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-23split: add gguf-split in the make build target (#6262)Pierrick Hymbert
2024-03-23server: flush stdout after logging in both text and json layout (#6253)Pierrick Hymbert
2024-03-23lookup: complement data from context with general text statistics (#5479)Johannes Gäßler
* lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens
2024-03-22common : default --hf-file to --model (#6234)Georgi Gerganov
2024-03-22convert-llama2c-to-ggml : enable conversion of GQA models (#6237)fraxy-v
* convert-llama2c-to-ggml: enable conversion of multiqueries, #5608 * add test in build action * Update build.yml * Update build.yml * Update build.yml * gg patch
2024-03-22quantize: options for output and token embedding tensors qtype (#6239)Kawrakow
* quantize: be able to specify the output tensor type * quantize: be able to specify the token embedding tensor type --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-22llama_model_loader: support multiple split/shard GGUFs (#6187)Pierrick Hymbert
* split: support in llama_model_loader * avoid copying the entire vector Co-authored-by: slaren <slarengh@gmail.com> * split: move llama_tensor_offset to llama_model_loader * llama_model_loader: PR feedbacks: - use only one gguf_context for metadata only - store all ggml_context in a vector as the files and mappings - store all weights in a vector along with the source tensor - rename ctx_gguf to meta - rename ctx_meta to contexts * avoid copying the entire vector * Simplify this by making these optional, switch some layer creation tensor optional Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Handle optional tensors Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * llama_model_loader: fail if backend cannot allocate buffer * fix mmap buffer management * llama_model_loader: map file to backend buffer if the allocation succeeds only * llama_model_loader: only map tensors included in the context * llama_model_loader: minor, use same variable name for consistency, fix spacing in types cast * llama_model_loader: fail if any of backend buffer cannot be allocated * spacing Co-authored-by: slaren <slarengh@gmail.com> * fix loop over pointer Co-authored-by: slaren <slarengh@gmail.com> * llama_model_loader: if n_tensors declared not equals to loaded tensors in split, throw an exception instead of asserting * llama_model_loader: ensure mappings vector has the expected size * llama_model_loader: use at instead of operator[] if this should never add to the map. * llama_model_loader: immediately add the backend buffer to the model buffers in order to free them if an error occurs in the next allocation. Reserve the expected size. * llama_model_loader: be sure the model mappings has enough capacity before allocating backend buffer * llama_model_loader: fix map -> unordered map * llama_split_prefix: use a clearer version, not pass split path len but dest max len. Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * llama : minor ggml-ci * llama : introduce some typedef helpers * docs: add model shard in hot topic * llama_model_loader: put mapping in a unique_ptr from the moment it is allocated Co-authored-by: slaren <slarengh@gmail.com> * fix llama_split_prefix --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-03-22ci: apply concurrency limit for github workflows (#6243)Minsoo Cheong
2024-03-22common : add HF arg helpers (#6234)Georgi Gerganov
* common : add HF arg helpers * common : remove defaults
2024-03-22llama : correction of the attn.v.weight quantization for IQ3_XS (#6209)Nexesenex
IQ3_XS was not mentioned, IQ3_S and IQ3_M were present twice. That PR corrects this in the manner which was probably intended initially.
2024-03-22tests : conditional python & node json schema tests (#6207)Olivier Chafik
* json: only attempt python & node schema conversion tests if their bins are present Tests introduced in https://github.com/ggerganov/llama.cpp/pull/5978 disabled in https://github.com/ggerganov/llama.cpp/pull/6198 * json: orange warnings when tests skipped * json: ensure py/js schema conv tested on ubuntu-focal-make * json: print env vars in test
2024-03-22json-schema-to-grammar : fix order of props + non-str const/enum (#6232)Olivier Chafik
* json: ordered json in server/schema converter to respect orig order * json: ws nits * json: support non-string const / enums
2024-03-22cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy (#6208)slaren
* cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy * add LLAMA_CUDA_NO_PEER_COPY to HIP build
2024-03-22readme : add RecurseChat to the list of UIs (#6219)Xiaoyi Chen
2024-03-22server : fix n_keep always showing as 0 in response (#6211)Jan Boon
2024-03-22server : enable continuous batching by default (#6231)Georgi Gerganov
2024-03-22metal : proper assert for mat-mat memory alignment (#6225)Georgi Gerganov
* metal : proper assert for mat-mat memory alignment ggml-ci * readme : add notice about the bug fix * metal : fix the fix ggml-ci
2024-03-22ci : add CURL flag for the mac builds (#6214)Vaibhav Srivastav
2024-03-22metal : pad n_ctx by 32 (#6177)Georgi Gerganov
* metal : require ne00 >= 128 for mat-mat kernels ggml-ci * llama : pad n_ctx by 32 ggml-ci
2024-03-22add blog link (#6222)Neo Zhang Jianyu
2024-03-22Fix params underscore convert to dash. (#6203)DAN™
* Fix params underscore convert to dash. * Update common/common.cpp --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-03-21server : update readme doc from `slot_id` to `id_slot` (#6213)Jan Boon
2024-03-21cuda : disable host register by default (#6206)slaren
2024-03-21Corrected typo to wrong file (#6199)semidark
The stated file `./devops/main-server.Dockerfile` does not exist. I figure that `.devops/server-intel.Dockerfile` was meant.
2024-03-21tests : disable system() calls (#6198)Georgi Gerganov
ggml-ci