summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-01-24cuda : fix 2-bit quants on amd hip (#5105)Engininja2
* cuda : fix 2-bit quants on amd hip * use __low2float intrinsic function for new quants
2024-01-24nix-shell: use addToSearchPathMichael Hueschen
thx to @SomeoneSerge for the suggestion!
2024-01-24nix: add cc to devShell LD_LIBRARY_PATHMichael Hueschen
this fixes the error I encountered when trying to run the convert.py script in a venv: ``` $ nix develop [...]$ source .venv/bin/activate (.venv) [...]$ pip3 install -r requirements.txt <... clipped ...> [...]$ python3 ./convert.py Traceback (most recent call last): File "/home/mhueschen/projects-reference/llama.cpp/./convert.py", line 40, in <module> from sentencepiece import SentencePieceProcessor File "/home/mhueschen/projects-reference/llama.cpp/.venv/lib/python3.11/site-packages/sentencepiece/__init__.py", line 13, in <module> from . import _sentencepiece ImportError: libstdc++.so.6: cannot open shared object file: No such file or directory ``` however, I am not sure this is the cleanest way to address this linker issue...
2024-01-24llama : pre-allocate input tensors in a separate buffer (#5100)slaren
2024-01-23metal : disable support for MUL_MAT F32 x F16Georgi Gerganov
2024-01-23Additional KL-divergence statistics (#5081)Kawrakow
* perplexity: add top-token probability * perplexity: add additional KL-divergence statistics * perplexity: a better organized KL-divergence statistics output --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-23CUDA: more info when no device code (#5088)Johannes Gäßler
2024-01-23minor : clean-up some warnings and style (#5094)Georgi Gerganov
* minor : clean-up some warnings and style ggml-ci * ggml : add comment
2024-01-23devops : add intel oneapi dockerfile (#5068)Xuan Son Nguyen
Co-authored-by: Xuan Son Nguyen <xuanson.nguyen@snowpack.eu>
2024-01-23llama.vim : added api key support (#5090)Michael Coppola
Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
2024-01-22llama : fix not enough space in buffer with Qwen (#5086)slaren
2024-01-22KL-divergence (#5076)Kawrakow
* kl-divergence: be able to save all logits to a file * Add ability to compute KL-divergence --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-22ggml : parallelize FP32 conversion when using BLAS (#5045)Reinforce-II
* make GGML_TASK_INIT phase can be run in multithread * multithreaded dequantize in mul_mat when using blas library * minor fixes * update outdated comment * fix coding style * simplify code Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-22llava : MobileVLM support (#4954)XiaotaoChen
* MobileVLM native implementation * delete depthwise_conv_2d and permute_cpy relative code, replace the two by the existed functions, and opt ldp definition, support LLAMA_PERF option for CMake * move android script to example/llava directory * Fix the editor config checks --------- Co-authored-by: Chenxiaotao03 <chenxiaotao03@meituan.com>
2024-01-22flake.nix: add a comment about flakes vs nixSomeone Serge
2024-01-22nix: add a comment on the many nixpkgs-with-cuda instancesSomeone Serge
2024-01-22nix: add a comment about makeScopeSomeone Serge
2024-01-22nix: refactor the cleanSource rulesSomeone Serge
2024-01-22workflows: nix-ci: drop the redundant "paths" filterSomeone Serge
2024-01-22workflows: nix-build-aarch64: rate limitSomeone Serge
2024-01-22workflows: nix-ci: rebuild on flake.lock updatesSomeone Serge
2024-01-22imatrix : keep intermediate imatrix results (#5077)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-22llama : support StableLM 2 1.6B (#5052)compilade
* llama : support StableLM 2 1.6B * convert : fix Qwen's set_vocab wrongly naming all special tokens [PAD{id}] * convert : refactor Qwen's set_vocab to use it for StableLM 2 too * nix : add tiktoken to llama-python-extra * convert : use presence of tokenizer.json to determine StableLM tokenizer loader It's a less arbitrary heuristic than the vocab size.
2024-01-22finetune : print sample-start/include-sample-start (#5072)Daniel Bevenius
This commit adds `--sample-start` and `--include-sample-start` to the output from the main function in finetune.cpp. The motivation for this is that even though these are set explicitly by the user via the command line, if one forgets to set them then it is useful to have their values printed out. Otherwise it is possible to go through the whole training process before realizing that the values are not what one expected. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-22llama : add Q3_K_XS (#5060)Kawrakow
* Add Q3_K_XS - intermediate size between Q2_K and Q3_K_S * Q3_K_XS: quanize first 1/8 of ffn_down layers with Q4_K Together with an importance matrix, this brings perplexity for LLaMA-v2-70B below the perplexity of the former Q2_K with a 800 MB smaller quantized model size. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-22ci : fix Windows CI by updating Intel SDE version (#5053)bobqianic
2024-01-22llama : add more qwen2 models (#5071)Shijie
2024-01-21Revert LLAMA_NATIVE to OFF in flake.nix (#5066)iSma
2024-01-21add safetensors support to convert-lora-to-ggml.py (#5062)kuronekosaiko
* add safetensors support to convert-lora-to-ggml.py * Update convert-lora-to-ggml.py Remove white space in line 69.
2024-01-21add `#include <string>` to unicode.h (#5051)bobqianic
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
2024-01-21Add ability to evauate multiple choice tasks (#5047)Kawrakow
* TruthfulQA: 1st attempt, does not look like it is working The same implementation can be used for HellaSwag as well, so I converted a HellaSwag validation dataset to the binary format used here and tested with that. The score is only around 50, so something is not quite right. * TruthfulQA: works but the result is bad I know it works because if I convert the HellaSwag validation data to the binary format used in the truthful_qa_score() function I get the exact same result as from the hellaswag_score() function. But I guess, the questions are tricky and the way I have done the combination of question + answer is very likely not the best. The TruthfulQA validation dataset contains 817 questions, with random chance result around 19%. With this version I get 29.1% for Mistral-7B and 55.2% for Mistral-7B-Instruct-v0.2. The HF leader board results for these two models are 42.2% and 68.3%, respectively. * TruthfulQA: fix random sample * TruthfulQA: prepare tasks in parallel for large test datasets * Rename truthful_qa to multiple_choice * Make MSVC happy I had forgotten that MSVC does not make constexpr's available inside a lambda. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-21Slightly faster imatrix (#5050)Kawrakow
* imatrix: speedup by avoiding unnecessary allocations and copies * imatrix: add --no-ppl option to skip PPL calculations altogether --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-21flake.lock: Update (#5054)Georgi Gerganov
Flake lock file updates: • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/9b19f5e77dd906cb52dade0b7bd280339d2a1f3d' (2024-01-13) → 'github:NixOS/nixpkgs/bbe7d8f876fbbe7c959c90ba2ae2852220573261' (2024-01-19) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-01-20convert : partially revert PR #4818 (#5041)Jared Van Bortel
2024-01-20perplexity : fix MSVC build after #5020 (#5043)Jared Van Bortel
* perplexity : fix MSVC build after #5020 * try a differerent fix
2024-01-20llama : run all KQV ops on the CPU with no KV offload (#5049)slaren
ggml-ci
2024-01-20cmake : add support for ccache (#5002)Herman Semenov
* Added support ccache for speedup recompilation * cmake : option to disable ccache --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-20Add a dart/flutter binding to README.md (#4882)adel boussaken
2024-01-20cuda : fix compile error in jetson platform (#4975)Kylin
* cuda: fix compile error in jetson platform * cuda: update comment in ggml-cuda.cu * cuda: update ggml-cuda.cu comment
2024-01-19finetune : fix ggml_allocr lifetimes (tmp workaround) (#5033)Uzo Nweke
* Fix issue with alloc causing max_compute_size to be calculated * remove ggml_allocr_free as suggested in issue #4791
2024-01-19imatrix : add README.mdGeorgi Gerganov
2024-01-19llama : support upcoming Qwen2 (#5037)Shijie
2024-01-19py : fix flake8 lintGeorgi Gerganov
2024-01-19winogrande: evaluate log-probs in parallel (#5036)Kawrakow
This is a relatively minor performance tweak resulting in ~10% speedup on my system. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-19llama : add CodeShell support (#5016)chiranko
* llama: add codeshell support * llama.cpp: fix codeshell with NeoX rope Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-19perplexity: avoid unnecessary alloocations and logit copies (#5035)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-19perplexity : faster Winogrande via batching (#5024)Georgi Gerganov
* perplexity : faster Winogrande via batching ggml-ci * perplexity : remove unused function * perplexity : only tokenize selected tasks for Winogrande
2024-01-19llama : fix falcon arch for tied output embeddings (#4978)John
* falcon arch fix for tied output embeddings * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update llama.cpp * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-18cmake : add ggml public headers (#5011)Georgi Gerganov
2024-01-18server : defer tasks when "slot unavailable" (#5018)Xuan Son Nguyen
* server: defer task when no slot is available * remove unnecessary log --------- Co-authored-by: Xuan Son Nguyen <xuanson.nguyen@snowpack.eu>