ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2024-03-26	IQ1_M: 1.75 bpw quantization (#6302)	Kawrakow
	* iq1_m: basics * iq1_m: basics-2 * iq1_m: CUDA dequantize works Very 1st shot I get PPL = 9.76 for LLaMA-v2-7B. * iq1_m: separate shifts for each group of 8 in a block We get PPL(LLaMA-v2-7B ) = 9.2810 PPL(LLaMA-v2-13B) = 6.8105 Not bad, but slightly higher than sqrt(PPL(IQ1_S) * PPL(IQ2_XXS)) which is the expected outcome given that IQ1_M is halfway between IQ1_S and IQ2_XXS in terms of bpw. From this, we would expect PPL = 9.14 for LLaMA-v2-7B PPL = 6.63 for LLaMA-v2-13B * iq1_m: go to 3-bit scales There is slight increase in PPL, but the 0.0625 bpw reduction in size is totally worth it. We now have PPL(LLaMA-v2-7B ) = 9.4469 at 1.96 bpw PPL(LLaMA-v2-13B) = 6.8717 at 1.93 bpw PPL(LLaMA-v2-70B) = 4.8568 at 1.85 bpw * iq1_m: scalar dot product * iq1_m: AVX2 dot product * iq1_m: very slightly faster AVX2 dot product * iq1_m: ARM_NEON dot product Works, but very slow (10.5 t/s) * iq1_m: Metal - dequantize works, dot product does not * iq1_m: Metal now works About the same performance as iq1_s. * iq1_m: minor * iq1_m: checking pure iq1_m quantization It is pretty bad: PPL(LLaMA-v2-7B) = 34 if we quantize output.weight with Q4_K. * iiq1_m: slightly faster ARM_NEON dot product 10.5 t/s -> 11.65 t/s * iq1_m: faster ARM_NEON dot product 11.65 t/s -> 14.9 t/s * iq1_m: another minor ARM_NEON dot product improvement 14.9 -> 15.0 t/s * iq1_m: small PPL improvement via super-block scale adjustment After quantizing block scales redo the super-block scale fit. PPL(LLaMA-v2-7B ) = 9.3346 PPL(LLaMA-v2-13B) = 6.8419 PPL(LLaMA-v2-70B) = 4.8294 PPL(Mistral-7B ) = 8.1624 * iq1_m: adapt to CUDA refactoring * iq1_m: remove unused variable We have progressed to warnings being errors. * iq1_m: add to backend-ops tests * iq1_m: fix Windows ARM * iq1_m: use common definition of iq1m_scale_t * cuda: assert -> NO_DEVICE_CODE * iq1_M: PR comments --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-26	convert-hf : fix exception in sentencepiece with added tokens (#6320)	Pedro Cuenca

2024-03-26	quantize : be able to override metadata by key (#6321)	Kawrakow
	* quantize: be able to override metadata by key * minor : spacing --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-26	embedding : adjust `n_ubatch` value (#6296)	Minsoo Cheong
	* embedding: assign `n_ubatch` value, print error on `n_batch` overflow * Update examples/embedding/embedding.cpp Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * use %ld instead of %lld * Revert "use %ld instead of %lld" This reverts commit ea753ede90a86a0699f65878cc8e2020ff5eabb8. --------- Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-03-26	server : add `n_discard` parameter (#6300)	Jan Boon

2024-03-25	nix: make `xcrun` visible in Nix sandbox for precompiling Metal shaders (#6118)	Joseph Stahl
	* Symlink to /usr/bin/xcrun so that `xcrun` binary is usable during build (used for compiling Metal shaders) Fixes https://github.com/ggerganov/llama.cpp/issues/6117 * cmake - copy default.metallib to install directory When metal files are compiled to default.metallib, Cmake needs to add this to the install directory so that it's visible to llama-cpp Also, update package.nix to use absolute path for default.metallib (it's not finding the bundle) * add `precompileMetalShaders` flag (defaults to false) to disable precompilation of metal shader Precompilation requires Xcode to be installed and requires disable sandbox on nix-darwin
2024-03-26	cuda : rename build flag to LLAMA_CUDA (#6299)	slaren

2024-03-25	nix: fix blas support (#6281)	Christian Kögler
	Since no blas was provided to buildInputs, the executable is built without blas support. This is a backport of NixOS/nixpkgs#298567
2024-03-25	tests : include IQ2_XXS and IQ2_XS in test-quantize-fns (#6303)	Kawrakow
	Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-25	flake.lock: Update (#6266)	Georgi Gerganov
	Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/d691274a972b3165335d261cc4671335f5c67de9' (2024-03-14) → 'github:NixOS/nixpkgs/44d0940ea560dee511026a53f0e2e2cde489b4d4' (2024-03-23) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-03-25	cuda : fix LLAMA_CUDA_F16 build (#6298)	slaren

2024-03-25	cuda : refactor into multiple files (#6269)	slaren

2024-03-25	Server: clean up OAI params parsing function (#6284)	Xuan Son Nguyen
	* server: clean up oai parsing function * fix response_format * fix empty response_format * minor fixes * add TODO for logprobs * update docs
2024-03-25	[SYCL] fix SYCL backend build on windows is break by LOG() error (#6290)	Neo Zhang Jianyu
	* fix LOG() error for SYCL, enhance erro check by CI * rollback to bash * add newline at end of file
2024-03-25	examples : add "retrieval" (#6193)	Minsoo Cheong
	* add `retrieval` example * add README * minor fixes * cast filepos on print * remove use of variable sized array * store similarities in separate vector * print error on insufficient batch size * fix error message printing * assign n_batch value to n_ubatch * fix param definitions * define retrieval-only parameters in retrieval.cpp * fix `--context-file` option to be provided multiple times for multiple files * use vector for `query_emb` * add usage description in README * fix merge conflict * fix usage printing * remove seed setting * fix lint * increase file read buffer size * retrieval : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-25	ggml : support AVX512VNNI (#6280)	Justine Tunney
	This change causes some quants (e.g. Q4_0, Q8_0) to go faster on some architectures (e.g. AMD Zen 4).
2024-03-24	Fix heap corruption from wmode out-of-bound writes on windows (#6272)	Rick G
	* would throw error on VS2022 on GGML_FREE(wmode) * wchar_t is usually 2 bytes, but malloc wants bytes * therefore `wmode_p++ = (wchar_t)mode;` could write off the end of the allocation * Fixes error possibly introduced by https://github.com/ggerganov/llama.cpp/pull/6248
2024-03-24	imatrix : fix wname for mul_mat_id ops (#6271)	Georgi Gerganov
	* imatrix : fix wname for mul_mat_id ops * also filter tensor names in mul_mat_id ops --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-03-24	Fixed lookup compilation issues on Windows (#6273)	Johannes Gäßler

2024-03-24	ci : close inactive issue, increase operations per run (#6270)	Pierrick Hymbert

2024-03-24	sampling : deduplicated code for probability distribution access (#6240)	Minsoo Cheong
	* sampling: remove duplicated code for probability distribution access * free original_logits * fix original_logits allocation * fixes based on review @cebtenzzre * change function name to `llama_sampling_prepare`
2024-03-24	[SYCL] offload op (#6217)	Meng, Hengyu
	* remove no USM methods * leave the schedule to ggml_backend_sched entirely
2024-03-24	Support build win release for SYCL (#6241)	Neo Zhang Jianyu
	* support release win * fix value * fix value * fix value * fix error * fix error * fix format
2024-03-23	use _wfopen instead of fopen on Windows (#6248)	Jared Van Bortel
	also fix missing #defines before windows.h, and BPE LF token on MSVC
2024-03-23	gitignore : gguf-split	Georgi Gerganov

2024-03-23	common: llama_load_model_from_url split support (#6192)	Pierrick Hymbert
	* llama: llama_split_prefix fix strncpy does not include string termination common: llama_load_model_from_url: - fix header name case sensitive - support downloading additional split in parallel - hide password in url * common: EOL EOF * common: remove redundant LLAMA_CURL_MAX_PATH_LENGTH definition * common: change max url max length * common: minor comment * server: support HF URL options * llama: llama_model_loader fix log * common: use a constant for max url length * common: clean up curl if file cannot be loaded in gguf * server: tests: add split tests, and HF options params * common: move llama_download_hide_password_in_url inside llama_download_file as a lambda * server: tests: enable back Release test on PR * spacing Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * spacing Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * spacing Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-23	server: docs: `--threads` and `--threads`, `--ubatch-size`, `--log-disable` ↵	Pierrick Hymbert
	(#6254)
2024-03-23	llama : add grok-1 support (#6204)	Julius Arkenberg
	* Add support for Grok model architecture * Revert convert-hf-to-gguf to default options * Fixed f_norm_rms_eps bug * Fix whitespaces * llama : fix grok rope type * llama : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-23	split: add gguf-split in the make build target (#6262)	Pierrick Hymbert

2024-03-23	server: flush stdout after logging in both text and json layout (#6253)	Pierrick Hymbert

2024-03-23	lookup: complement data from context with general text statistics (#5479)	Johannes Gäßler
	* lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens * fixup! lookup: evaluation tools, use corpus/previous gens
2024-03-22	common : default --hf-file to --model (#6234)	Georgi Gerganov

2024-03-22	convert-llama2c-to-ggml : enable conversion of GQA models (#6237)	fraxy-v
	* convert-llama2c-to-ggml: enable conversion of multiqueries, #5608 * add test in build action * Update build.yml * Update build.yml * Update build.yml * gg patch
2024-03-22	quantize: options for output and token embedding tensors qtype (#6239)	Kawrakow
	* quantize: be able to specify the output tensor type * quantize: be able to specify the token embedding tensor type --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-03-22	llama_model_loader: support multiple split/shard GGUFs (#6187)	Pierrick Hymbert
	* split: support in llama_model_loader * avoid copying the entire vector Co-authored-by: slaren <slarengh@gmail.com> * split: move llama_tensor_offset to llama_model_loader * llama_model_loader: PR feedbacks: - use only one gguf_context for metadata only - store all ggml_context in a vector as the files and mappings - store all weights in a vector along with the source tensor - rename ctx_gguf to meta - rename ctx_meta to contexts * avoid copying the entire vector * Simplify this by making these optional, switch some layer creation tensor optional Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Handle optional tensors Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * llama_model_loader: fail if backend cannot allocate buffer * fix mmap buffer management * llama_model_loader: map file to backend buffer if the allocation succeeds only * llama_model_loader: only map tensors included in the context * llama_model_loader: minor, use same variable name for consistency, fix spacing in types cast * llama_model_loader: fail if any of backend buffer cannot be allocated * spacing Co-authored-by: slaren <slarengh@gmail.com> * fix loop over pointer Co-authored-by: slaren <slarengh@gmail.com> * llama_model_loader: if n_tensors declared not equals to loaded tensors in split, throw an exception instead of asserting * llama_model_loader: ensure mappings vector has the expected size * llama_model_loader: use at instead of operator[] if this should never add to the map. * llama_model_loader: immediately add the backend buffer to the model buffers in order to free them if an error occurs in the next allocation. Reserve the expected size. * llama_model_loader: be sure the model mappings has enough capacity before allocating backend buffer * llama_model_loader: fix map -> unordered map * llama_split_prefix: use a clearer version, not pass split path len but dest max len. Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> * llama : minor ggml-ci * llama : introduce some typedef helpers * docs: add model shard in hot topic * llama_model_loader: put mapping in a unique_ptr from the moment it is allocated Co-authored-by: slaren <slarengh@gmail.com> * fix llama_split_prefix --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-03-22	ci: apply concurrency limit for github workflows (#6243)	Minsoo Cheong

2024-03-22	common : add HF arg helpers (#6234)	Georgi Gerganov
	* common : add HF arg helpers * common : remove defaults
2024-03-22	llama : correction of the attn.v.weight quantization for IQ3_XS (#6209)	Nexesenex
	IQ3_XS was not mentioned, IQ3_S and IQ3_M were present twice. That PR corrects this in the manner which was probably intended initially.
2024-03-22	tests : conditional python & node json schema tests (#6207)	Olivier Chafik
	* json: only attempt python & node schema conversion tests if their bins are present Tests introduced in https://github.com/ggerganov/llama.cpp/pull/5978 disabled in https://github.com/ggerganov/llama.cpp/pull/6198 * json: orange warnings when tests skipped * json: ensure py/js schema conv tested on ubuntu-focal-make * json: print env vars in test
2024-03-22	json-schema-to-grammar : fix order of props + non-str const/enum (#6232)	Olivier Chafik
	* json: ordered json in server/schema converter to respect orig order * json: ws nits * json: support non-string const / enums
2024-03-22	cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy (#6208)	slaren
	* cuda : add LLAMA_CUDA_NO_PEER_COPY to workaround broken ROCm p2p copy * add LLAMA_CUDA_NO_PEER_COPY to HIP build
2024-03-22	readme : add RecurseChat to the list of UIs (#6219)	Xiaoyi Chen

2024-03-22	server : fix n_keep always showing as 0 in response (#6211)	Jan Boon

2024-03-22	server : enable continuous batching by default (#6231)	Georgi Gerganov

2024-03-22	metal : proper assert for mat-mat memory alignment (#6225)	Georgi Gerganov
	* metal : proper assert for mat-mat memory alignment ggml-ci * readme : add notice about the bug fix * metal : fix the fix ggml-ci
2024-03-22	ci : add CURL flag for the mac builds (#6214)	Vaibhav Srivastav

2024-03-22	metal : pad n_ctx by 32 (#6177)	Georgi Gerganov
	* metal : require ne00 >= 128 for mat-mat kernels ggml-ci * llama : pad n_ctx by 32 ggml-ci
2024-03-22	add blog link (#6222)	Neo Zhang Jianyu

2024-03-22	Fix params underscore convert to dash. (#6203)	DAN™
	* Fix params underscore convert to dash. * Update common/common.cpp --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-03-21	server : update readme doc from `slot_id` to `id_slot` (#6213)	Jan Boon