ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2023-12-27	llama : add AWQ for llama, llama2, mpt, and mistral models (#4593)	Nam D. Tran
	* update: awq support llama-7b model * update: change order * update: benchmark results for llama2-7b * update: mistral 7b v1 benchmark * update: support 4 models * fix: Readme * update: ready for PR * update: readme * fix: readme * update: change order import * black * format code * update: work for bot mpt and awqmpt * update: readme * Rename to llm_build_ffn_mpt_awq * Formatted other files * Fixed params count * fix: remove code * update: more detail for mpt * fix: readme * fix: readme * update: change folder architecture * fix: common.cpp * fix: readme * fix: remove ggml_repeat * update: cicd * update: cicd * uppdate: remove use_awq arg * update: readme * llama : adapt plamo to new ffn ggml-ci --------- Co-authored-by: Trần Đức Nam <v.namtd12@vinai.io> Co-authored-by: Le Hoang Anh <v.anhlh33@vinai.io> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-27	finetune : fix output formatting in print_params (#4653)	Daniel Bevenius
	This commit fixes the output formatting in the print_params function which currently looks like this: ```console print_params: n_vocab: 32000 print_params: n_ctx: 128 print_params: n_embd: 4096 print_params: n_ff: 11008 print_params: n_head: 32 print_params: n_head_kv: 32 print_params: n_layer: 32 print_params: norm_rms_eps : 0.000010 print_params: rope_freq_base : 10000.000000 print_params: rope_freq_scale : 1.000000 ``` With this comit the output will look like this: ```console print_params: n_vocab : 32000 print_params: n_ctx : 128 print_params: n_embd : 4096 print_params: n_ff : 11008 print_params: n_head : 32 print_params: n_head_kv : 32 print_params: n_layer : 32 print_params: norm_rms_eps : 0.000010 print_params: rope_freq_base : 10000.000000 print_params: rope_freq_scale : 1.000000 ``` Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2023-12-27	scripts : add sync-ggml-am.sh	Georgi Gerganov

2023-12-27	ggml : fix dot product for ARM (#4630)	Georgi Gerganov
	ggml-ci
2023-12-27	Add byte token type when tokenizer.model is not exists (#4641)	wonjun Jang
	* Add byte token type to hf format * remove unused variable
2023-12-26	cuda : fix vmm pool with multi GPU (#4620)	slaren
	* cuda : fix vmm pool with multi GPU * hip * use recommended granularity instead of minimum * better error checking * fix mixtral * use cudaMemcpy3DPeerAsync * use cuda_pool_alloc in ggml_cuda_op_mul_mat * consolidate error checking in ggml_cuda_set_device * remove unnecessary inlines ggml-ci * style fixes * only use vmm for the main device * fix scratch buffer size, re-enable vmm pool for all devices * remove unnecessary check id != g_main_device
2023-12-26	Update comment for AdamW implementation reference. (#4604)	WillCorticesAI
	Co-authored-by: Will Findley <findley@gmail.com>
2023-12-26	Fix new CUDA10 compilation errors (#4635)	FantasyGmm

2023-12-25	Adding Emeltal reference to UI list (#4629)	Paul Tsochantaris

2023-12-24	simplify bug issue template (#4623)	slaren

2023-12-24	llama : add PLaMo model (#3557)	Shintarou Okada
	* add plamo mock * add tensor loading * plamo convert * update norm * able to compile * fix norm_rms_eps hparam * runnable * use inp_pos * seems ok * update kqv code * remove develop code * update README * shuffle attn_q.weight and attn_output.weight for broadcasting * remove plamo_llm_build_kqv and use llm_build_kqv * fix style * update * llama : remove obsolete KQ_scale * plamo : fix tensor names for correct GPU offload --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-24	cuda : improve cuda pool efficiency using virtual memory (#4606)	slaren
	* cuda : improve cuda pool efficiency using virtual memory * fix mixtral * fix cmake build * check for vmm support, disable for hip ggml-ci * fix hip build * clarify granularity * move all caps to g_device_caps * refactor error checking * add cuda_pool_alloc, refactor most pool allocations ggml-ci * fix hip build * CUBLAS_TF32_TENSOR_OP_MATH is not a macro * more hip crap * llama : fix msvc warnings * ggml : fix msvc warnings * minor * minor * cuda : fallback to CPU on host buffer alloc fail * Update ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * ensure allocations are always aligned * act_size -> actual_size --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2023-12-23	fallback to CPU buffer if host buffer alloc fails (#4610)	slaren

2023-12-23	ci(docker): fix tags in "Build and push docker image (tagged)" (#4603)	Samuel Maynard

2023-12-23	server : allow to specify custom prompt for penalty calculation (#3727)	Alexey Parfenov

2023-12-23	grammar : check the full vocab only if necessary (opt) (#4306)	kalomaze
	* Check the full vocab for grammar only if necessary * Fix missing logit restoration step (?) Does this matter, actually? * Fix whitespace / formatting * Adjust comment * Didn't mean to push test gbnf * Split sampling into the helper function (?) And also revert the changes made to the header * common : fix final newline --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-23	CUDA: fixed row rounding for 0 tensor splits (#4594)	Johannes Gäßler

2023-12-22	lookup : add prompt lookup decoding example (#4484)	LeonEricsson
	* initial commit, going through initializations * main loop finished, starting to debug * BUG: generates gibberish/repeating tokens after a while * kv_cache management * Added colors to distinguish drafted tokens (--color). Updated README * lookup : fix token positions in the draft batch * lookup : use n_draft from CLI params * lookup : final touches --------- Co-authored-by: Leon Ericsson <leon.ericsson@icloud.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-22	sync : ggml (fix im2col) (#4591)	Georgi Gerganov
	* cuda : fix im2col_f32_f16 (ggml/#658) ggml-ci * ggml-alloc : fix ggml_tallocr_is_own --------- Co-authored-by: leejet <leejet714@gmail.com>
2023-12-22	cuda : fix jetson compile error (#4560)	FantasyGmm
	* fix old jetson compile error * Update Makefile * update jetson detect and cuda version detect * update cuda marco define * update makefile and cuda,fix some issue * Update README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update Makefile * Update README.md --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-22	Fix CudaMemcpy direction (#4599)	Henrik Forstén

2023-12-22	llama : fix platforms without mmap (#4578)	slaren
	* llama : fix platforms without mmap * win32 : limit prefetch size to the file size * fix win32 error clobber, unnecessary std::string in std::runtime_error
2023-12-22	ggml : add comment about backward GGML_OP_DIAG_MASK_INF (#4203)	Herman Semenov

2023-12-22	make : add LLAMA_HIP_UMA option (#4587)	Michael Kesper
	NB: LLAMA_HIP_UMA=1 (or any value) adds MK_CPPFLAG -DGGML_HIP_UMA
2023-12-22	ci : tag docker image with build number (#4584)	rhuddleston

2023-12-22	readme : add zig bindings (#4581)	Deins

2023-12-22	ggml : extend `enum ggml_log_level` with `GGML_LOG_LEVEL_DEBUG` (#4579)	bobqianic

2023-12-22	llama : add ability to cancel model loading (#4462)	crasm
	* llama : Add ability to cancel model load Updated llama_progress_callback so that if it returns false, the model loading is aborted. * llama : Add test for model load cancellation * Fix bool return in llama_model_load, remove std::ignore use * Update llama.cpp Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * Fail test if model file is missing * Revert "Fail test if model file is missing" This reverts commit 32ebd525bf7e5a87ee8a3dbaab3d92ce79fbf23d. * Add test-model-load-cancel to Makefile * Revert "Revert "Fail test if model file is missing"" This reverts commit 2796953257ee5383fa7c8fe8fa8fc888c048fb0b. * Simplify .gitignore for tests, clang-tidy fixes * Label all ctest tests * ci : ctest uses -L main * Attempt at writing ctest_with_model * ci : get ci/run.sh working with test-model-load-cancel * ci : restrict .github/workflows/build.yml ctest to -L main * update requirements.txt * Disable test-model-load-cancel in make * Remove venv before creation * Restructure requirements.txt Top-level now imports the specific additional requirements for each python file. Using `pip install -r requirements.txt` will fail if versions become mismatched in the per-file requirements. * Make per-python-script requirements work alone This doesn't break the main requirements.txt. * Add comment * Add convert-persimmon-to-gguf.py to new requirements.txt scheme * Add check-requirements.sh script and GitHub workflow * Remove shellcheck installation step from workflow * Add nocleanup special arg * Fix merge see: https://github.com/ggerganov/llama.cpp/pull/4462#discussion_r1434593573 * reset to upstream/master * Redo changes for cancelling model load --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2023-12-21	ggml : change ggml_scale to take a float instead of tensor (#4573)	Georgi Gerganov
	* ggml : change ggml_scale to take a float instead of tensor * ggml : fix CPU implementation * tests : fix test-grad0 ggml-ci
2023-12-21	gguf-py : fix broken link	Georgi Gerganov

2023-12-21	gguf : simplify example dependencies	Georgi Gerganov

2023-12-21	ci : add `jlumbroso/free-disk-space` to docker workflow (#4150)	Samuel Maynard
	* [github][workflows][docker]: removes hardcoded `ggerganov` from `ghcr` repo * [github][workflows][docker]: adds `jlumbroso/free-disk-space`
2023-12-21	llama : initial ggml-backend integration (#4520)	slaren
	* llama : initial ggml-backend integration * add ggml-metal * cuda backend can be used though ggml-backend with LLAMA_GGML_BACKEND_CUDA_TEST access all tensor data with ggml_backend_tensor_get/set * add ggml_backend_buffer_clear zero-init KV cache buffer * add ggml_backend_buffer_is_hos, used to avoid copies if possible when accesing tensor data * disable gpu backends with ngl 0 * more accurate mlock * unmap offloaded part of the model * use posix_fadvise64(.., POSIX_FADV_SEQUENTIAL) to improve performance with mmap * update quantize and lora * update session copy/set to use ggml-backend ggml-ci * use posix_fadvise instead of posix_fadvise64 * ggml_backend_alloc_ctx_tensors_from_buft : remove old print * llama_mmap::align_offset : use pointers instead of references for out parameters * restore progress_callback behavior * move final progress_callback call to load_all_data * cuda : fix fprintf format string (minor) * do not offload scales * llama_mmap : avoid unmapping the same fragments again in the destructor * remove unnecessary unmap * metal : add default log function that prints to stderr, cleanup code ggml-ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-21	llama : allow getting n_batch from llama_context in c api (#4540)	Marcus Dunn
	* allowed getting n_batch from llama_context in c api * changed to use `uint32_t` instead of `int` * changed to use `uint32_t` instead of `int` in `llama_n_ctx` * Update llama.h --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-21	metal : fix `ggml_metal_log` vargs (#4373)	Finn Voorhees

2023-12-21	cuda : ROCm AMD Unified Memory Architecture (UMA) handling (#4449)	Erik Garrison
	* AMD ROCm: handle UMA memory VRAM expansions This resolves #2797 by allowing ROCm AMD GPU users with a UMA to dynamically expand the VRAM allocated to the GPU. Without this, AMD ROCm users with shared CPU/GPU memory usually are stuck with the BIOS-set (or fixed) framebuffer VRAM, making it impossible to load more than 1-2 layers. Note that the model is duplicated in RAM because it's loaded once for the CPU and then copied into a second set of allocations that are managed by the HIP UMA system. We can fix this later. * clarify build process for ROCm on linux with cmake * avoid using deprecated ROCm hipMallocHost * keep simplifying the change required for UMA * cmake: enable UMA-compatible allocation when LLAMA_HIP_UMA=ON
2023-12-21	ggml-cuda: Fix HIP build by adding define for __trap (#4569)	arlo-phoenix
	Regression of 139882392258671ffe5acdfcadc0bc08572d6eef HIP doesn't have trap, only abort
2023-12-21	common : remove incorrect --model-draft default (#4568)	Jared Van Bortel

2023-12-21	CUDA: mul_mat_id always on GPU for batches >= 32 (#4553)	Johannes Gäßler

2023-12-21	readme : update coding guidelines	Georgi Gerganov

2023-12-21	py : open merges file as 'utf-8' (#4566)	howlger
	Otherwise, on Windows converting bling-phi-2-v0 (<https://huggingface.co/llmware/bling-phi-2-v0>) via convert-hf-to-gguf.py will fail with the following error: ``` Traceback (most recent call last): File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 1061, in <module> model_instance.set_vocab() File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 52, in set_vocab self._set_vocab_gpt2() File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 264, in _set_vocab_gpt2 special_vocab = gguf.SpecialVocab(dir_model, load_merges=True) File "C:\Users\User\git\gguf\gguf\vocab.py", line 33, in __init__ self._load(Path(path)) File "C:\Users\User\git\gguf\gguf\vocab.py", line 81, in _load self._try_load_merges_txt(path) File "C:\Users\User\git\gguf\gguf\vocab.py", line 95, in _try_load_merges_txt for line in fp: File "C:\Users\User\miniconda3\envs\gguf\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1415: character maps to <undefined> ```
2023-12-21	cuda : better error message for ggml_get_rows (#4561)	bobqianic
	* Update ggml-cuda.cu * Update ggml-cuda.cu * Update ggml-cuda.cu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-12-21	cuda : replace asserts in wrong architecture checks with __trap (#4556)	slaren
	* cuda : replace asserts in wrong architecture checks with __trap * make bad_arch noreturn, remove returns
2023-12-21	llama : disable per-tensor info prints on model load (#4562)	Johannes Gäßler

2023-12-21	Fix access violation in ggml_cuda_free_data if tensor->extra is NULL (#4554)	LoganDark

2023-12-20	CUDA: Faster Mixtral prompt processing (#4538)	Johannes Gäßler
	* CUDA: make MoE tensors contiguous for batch size>1 * Update ggml-cuda.cu Co-authored-by: slaren <slarengh@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com>
2023-12-19	ggml : fixed check for _MSC_VER (#4535)	Eric Sommerlade
	Co-authored-by: Eric Sommerlade <ersomme@microsoft.com>
2023-12-18	ggml-cuda: Fix HIP build (#4528)	arlo-phoenix
	regression of #4490 Adds defines for two new datatypes cublasComputeType_t, cudaDataType_t. Currently using deprecated hipblasDatatype_t since newer ones very recent.
2023-12-18	llama.swiftui : add tinyllama 1.1B F16	Georgi Gerganov

2023-12-18	llama.swiftui : add more models	Georgi Gerganov