summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-08-25ci : pip install gguf in editable mode (#2782)Georgi Gerganov
ggml-ci
2023-08-25gguf : export objects to user code (#2780)M. Yusuf Sarıgöz
* gguf export more objects to user code * gguf export all objects to user code for now * gguf : bump version
2023-08-25ROCm Port (#1087)Henri Vasserman
* use hipblas based on cublas * Update Makefile for the Cuda kernels * Expand arch list and make it overrideable * Fix multi GPU on multiple amd architectures with rocblas_initialize() (#5) * add hipBLAS to README * new build arg LLAMA_CUDA_MMQ_Y * fix half2 decomposition * Add intrinsics polyfills for AMD * AMD assembly optimized __dp4a * Allow overriding CC_TURING * use "ROCm" instead of "CUDA" * ignore all build dirs * Add Dockerfiles * fix llama-bench * fix -nommq help for non CUDA/HIP --------- Co-authored-by: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com> Co-authored-by: ardfork <134447697+ardfork@users.noreply.github.com> Co-authored-by: funnbot <22226942+funnbot@users.noreply.github.com> Co-authored-by: Engininja2 <139037756+Engininja2@users.noreply.github.com> Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com> Co-authored-by: jammm <2500920+jammm@users.noreply.github.com> Co-authored-by: jdecourval <7315817+jdecourval@users.noreply.github.com>
2023-08-25cuda : add RoPE kernel for mode == 2 (NeoX) (#2760)Georgi Gerganov
* cuda : add RoPE kernel for mode == 2 (NeoX) * falcon : do not offload the embeddings layer
2023-08-25gguf : make gguf pip-installableM. Yusuf Sarıgöz
* gitignore : add dist and rm pyproject.toml * gguf: prepare as Pip package * gguf: prepare as Pip package * gguf : fix line endings * requirements : add gguf * gguf : update readme with build notes * gguf : update readme with build notes * gguf : add notes for tests
2023-08-25ggml-alloc : enlarge size of parse_seq (#2776)Shouzheng Liu
Since we also store barriers in this array, we need to double its size.
2023-08-24Added `enum` to `llama_token_get_type` return type (#2774)Marcus Dunn
2023-08-24convert.py : try to determine n_ctx automatically for CodeLlama (#2770)slaren
2023-08-24gguf : add rope_freq_base parameter for CodeLlama (#2769)slaren
2023-08-24falcon : write file typeGeorgi Gerganov
2023-08-24metal : bug-fix when enable ggml-alloc (#2757)Shouzheng Liu
* metal: better memory alloc w/ concurrency dispatch The ggml-alloc should only free tensors at memory barriers. * ggml-alloc: avoid return silently In certain cases, the allocate_node() function may silently return without performing any memory allocation.
2023-08-24convert : auto-determine model name based on dir + scripts updateGeorgi Gerganov
2023-08-24Fix for main example getting stuck when -n -2 and --interactive (#2767)Kerfuffle
* Fix for main example getting stuck when -n -2 and --interactive * Add a comment so future generations may suffer less.
2023-08-24fix convert.py for codellama, add llama 34B to the list of recognized models ↵slaren
(#2768)
2023-08-24Tag release with build number (#2732)DannyDaemonic
* Modified build.yml to use build number for release * Add the short hash back into the tag * Prefix the build number with b
2023-08-24metal : add Q8_0 support (#2763)Georgi Gerganov
* metal : add dequantize_q8_0 kernel * metal : add mul_mat_q8_0_f32 kernel * metal : add Q8_0 mul_mm kernel
2023-08-24llama : escape all U+2581 in a string (#2750)Georgi Gerganov
2023-08-24llama : fix grammar sometimes generating null char (#2756)Evan Jones
2023-08-23readme : fix linkGeorgi Gerganov
2023-08-23minor : fix trailing whitespaceGeorgi Gerganov
2023-08-23readme : update hot topicsGeorgi Gerganov
2023-08-23llm : add Falcon support (#2717)Georgi Gerganov
* llama : refactor GGUF constants into static maps * llama : check if model architecture is known * llama : refactor llama_model_load_internal() * gguf : add KV constant maps * llm : read arch-specific KVs * convert : add dummy scores + types * falcon : load tensor data (CPU only) * llama : fix loading progress bar * llama : add arch member to llama_model * falcon : CPU inference working * falcon : support non-40B models * falcon : minor * llama : minor updates ggml-ci * convert-falcon-hf-to-gguf.py : fix special token mapping * llama.cpp : llama default UNK token = id 0 * llama.cpp : fix bpe tokenizer * llama.cpp : fix the fix of bpe tokenizer * ggml : pass eps to ggml_norm * metal : implement RoPE (mode = 2) + avoid ggml_repeat * ggml : ggml_repeat always creates new tensor * falcon : copy-paste self-attention from LLaMA * metal : print extra compute pipeline info * falcon : minor changes (still chasing the Metal problem) * llama.cpp : fix linefeed token * metal : fix GELU kernel numerical stability by using precise::tanh * metal : temporary workaround for the concurrency optimization bug * falcon : add CUDA offloading (#2739) * llama : better model naming and size reporting * llama : prep new tokenizer support * llama : advanced BPE tokenizer based on ggllm.cpp imlpementation * llama : remove oboslete comment ggml-ci * common : remove obsolete BPE API + disable test-tokenizer-1 * llama : revert BPE special-case in llama_byte_to_token() * cuda : add TODOs for RoPE NeoX implementation * llama : default special tokens based on vocab type * perplexity : add log for start of tokenization --------- Co-authored-by: klosax <131523366+klosax@users.noreply.github.com> Co-authored-by: slaren <slarengh@gmail.com>
2023-08-23minor : fix trailing whitespaceGeorgi Gerganov
2023-08-23examples : restore the functionality to import llama2.c models (#2685)Olivier Chafik
* Fix import of llama2.c models that don't share weights between embedding layers * llama2c: reinstate ggmlv3 conversion output + update readme w/ gguf conv * llama2.c: comment out legacy "load from ggml model" logic * llama2.c: convert special-cased "<0xXX>" single byte tokens from tokenizer.bin
2023-08-23fix convert-lora-to-ggml.py (#2738)slaren
2023-08-23main : insert bos if no tokens (#2727)klosax
* main.cpp : insert bos if no tokens * Update examples/main/main.cpp * Update examples/main/main.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-08-23gitignore : fix for windows (#2729)akawrykow
2023-08-23chmod : make scripts executable (#2675)Cebtenzzre
2023-08-23devops : RPM Specs (#2723)JohnnyB
* Create llama-cpp.srpm * Rename llama-cpp.srpm to llama-cpp.srpm.spec Correcting extension. * Tested spec success. * Update llama-cpp.srpm.spec * Create lamma-cpp-cublas.srpm.spec * Create lamma-cpp-clblast.srpm.spec * Update lamma-cpp-cublas.srpm.spec Added BuildRequires * Moved to devops dir
2023-08-23Fix values shown in the quantize tool help (#2735)Kawrakow
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-08-23Strided perplexity (#2714)Kawrakow
* Implementing strided computation of perplexity * Alternative way to output PPL results --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-08-23Fix ggml to gguf conversion on Windows (#2733)IgnacioFDM
This fixes `RuntimeWarning: overflow encountered in long_scalars` Credit: anon (not mine)
2023-08-23server : allow json array in prompt or content for direct token input (#2306)Xiao-Yong Jin
* server: allow json array in prompt or content We accept an array of strings and numbers representing tokens, in addition to the current string valued prompt or content. This allows direct token input, so that any special tokens can be processed and used at the frontend during the construction of the json data, before sending to the server. And the server does not need to know or parse special tokens from textual input. With this, we can use EOS and BOS used in llama-2-chat models. * server: use tokenizePrompt(json) and default "" if empty prompt * server: fix prompt check * server: tokenize endpoint no longer adds BOS
2023-08-22docs : add grammar docs (#2701)Evan Jones
* docs : add grammar docs * tweaks to grammar guide * rework GBNF example to be a commented grammar
2023-08-22Improve handling of special tokens in GGML to GGUF converter (#2725)Kerfuffle
* Improve UNK, BOS, EOS token handling when converting without metadata. * Allow importing as a module. * Remove some obsolete code and minor cleanups. * Set default UNK token mapping from -1 to 0 in llama.cpp * Try to handle overflow due to buggy Windows Python with a better error message
2023-08-23llama : fix whitespace escaping in tokenizer (#2724)goerch
2023-08-22CUDA: use mul_mat_q kernels by default (#2683)Johannes Gäßler
2023-08-22convert.py : clarifying error message (#2718)Alex Petenchea
2023-08-22Fix CUDA softmax by subtracting max value before exp (#2665)Jiahao Li
2023-08-22gguf : add ftype meta info to the model (#2710)Georgi Gerganov
* llama : add ftype meta info to the model ggml-ci * convert.py : add ftype when converting (does not work) * convert.py : fix Enum to IntEnum ggml-ci
2023-08-22Quantization imrovements for k_quants (#2707)Kawrakow
* Improve LLaMA-2 2-, 3- and 4-bit quantization * Q3_K_S: use Q5_K for 1st 2 layers of attention.wv and feed_forward.w2 * Q4_K_S: use Q6_K for 1st 2 layers of attention.wv and feed_forward.w2 * Q2_K and Q3_K_M: use Q5_K instead of Q4_K for 1st 2 layers of attention.wv and feed_forward.w2 This leads to a slight model sized increase as follows: Q2_K : 2.684G vs 2.670G Q3_K_S: 2.775G vs 2.745G Q3_K_M: 3.071G vs 3.057G Q4_K_S: 3.592G vs 3.563G LLaMA-2 PPL for context 512 changes as follows: Q2_K : 6.6691 vs 6.8201 Q3_K_S: 6.2129 vs 6.2584 Q3_K_M: 6.0387 vs 6.1371 Q4_K_S: 5.9138 vs 6.0041 There are improvements for LLaMA-1 as well, but they are way smaller than the above. * Minor 4-bit quantization improvement For the same model size as previus commit, we get PPL = 5.9069 vs 5.9138. * Some more fine tuning * Adding make_qkx2_quants With it, we get PPL = 5.8828 for L2-7B Q4_K_S. * Another minor improvement * Q2_K improvement Smaller model, lower perplexity. 7B: file size = 2.632G, PPL = 6.3772 vs original 2.670G PPL = 6.8201 12B: file size = 5.056G, PPL = 5.4577 vs original 5.130G PPL = 5.7178 It is mostly Q3_K except for tok_embeddings, attention.wq, attention.wk, which are Q2_K * Iterating * Revert Q5_K back to make_qkx1_quants * Better Q6_K * make_qkx2_quants is better for Q5_K after all * Fix after rebasing on master * Fix for changed tensor names --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2023-08-22embedding : evaluate prompt in batches (#2713)slaren
2023-08-22ggml-cuda : use graph allocator (#2684)slaren
use a different function for no_alloc to avoid breaking backwards compat, fixes lora remove 512 n_batch limit fixed 2048 batch size cleanup Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2023-08-22ggml : sync latest (SAM + SD operators, CUDA alibi) (#2709)Georgi Gerganov
* ggml : sync latest (SAM + SD operators, CUDA alibi) ggml-ci * ggml : fix tabs
2023-08-22llama-bench : minor fixes (#2695)slaren
2023-08-22ggml : support CUDA's half type for aarch64(#1455) (#2670)Kylin
* ggml: support CUDA's half type for aarch64(#1455) support CUDA's half type for aarch64 in ggml_fp16_t definition * ggml: use __CUDACC__ to recognise nvcc compiler
2023-08-22metal : add missing barriers for mul-mat (#2699)Shouzheng Liu
2023-08-22server : fallback to default if client param is null (#2688)Jhen-Jie Hong
* server : fallback to default if client param is null * server : do not overwrite 404 if status is 500 from exception_handler
2023-08-21Fix convert-llama-ggmlv3-to-gguf.py vocab conversion (#2698)Kerfuffle
When converting without metadata, the hex value for bytes entries weren't 0 padded to 2 digits.
2023-08-21py : remove obsolete scriptGeorgi Gerganov