summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-11-05Allow common process_escapes to handle \x sequences (#3928)Kerfuffle
* Allow common process_escapes to handle \x sequences * Fix edge case when second hex digit is NUL
2023-11-05server : fix typo for --alias shortcut from -m to -a (#3958)Thái Hoàng Tâm
2023-11-05cuda : fix disabling device with --tensor-split 1,0 (#3951)Jared Van Bortel
Co-authored-by: slaren <slarengh@gmail.com>
2023-11-05llama : mark LLM_ARCH_STARCODER as full offload supported (#3945)Meng Zhang
as done in https://github.com/ggerganov/llama.cpp/pull/3827
2023-11-05cmake : MSVC instruction detection (fixed up #809) (#3923)Eve
* Add detection code for avx * Only check hardware when option is ON * Modify per code review sugguestions * Build locally will detect CPU * Fixes CMake style to use lowercase like everywhere else * cleanup * fix merge * linux/gcc version for testing * msvc combines avx2 and fma into /arch:AVX2 so check for both * cleanup * msvc only version * style * Update FindSIMD.cmake --------- Co-authored-by: Howard Su <howard0su@gmail.com> Co-authored-by: Jeremy Dunn <jeremydunn123@gmail.com>
2023-11-05ci : use intel sde when ci cpu doesn't support avx512 (#3949)Eve
2023-11-05cuda : revert CUDA pool stuff (#3944)slaren
* Revert "cuda : add ROCM aliases for CUDA pool stuff (#3918)" This reverts commit 629f917cd6b96ba1274c49a8aab163b1b189229d. * Revert "cuda : use CUDA memory pool with async memory allocation/deallocation when available (#3903)" This reverts commit d6069051de7165a4e06662c89257f5d2905bb156. ggml-ci
2023-11-04gguf-py: Support 01.AI Yi models (#3943)Kerfuffle
2023-11-03metal : round up to 16 to fix MTLDebugComputeCommandEncoder assertion (#3938)Peter Sugihara
2023-11-03ggml-metal: fix yarn rope (#3937)Xiao-Yong Jin
2023-11-03ggml-cuda : move row numbers to x grid dim in mmv kernels (#3921)slaren
2023-11-03speculative : change default p_accept to 0.5 + CLI args (#3919)Georgi Gerganov
ggml-ci
2023-11-03common : YAYF (yet another YARN fix) (#3925)Georgi Gerganov
ggml-ci
2023-11-03llama : change yarn_ext_factor placeholder to -1 (#3922)cebtenzzre
2023-11-02cuda : add ROCM aliases for CUDA pool stuff (#3918)Kerfuffle
2023-11-02cmake : fix relative path to git submodule index (#3915)Andrei
2023-11-02readme : add notice about #3912Georgi Gerganov
2023-11-02cuda : fix const ptrs warning causing ROCm build issues (#3913)Georgi Gerganov
2023-11-02cuda : use CUDA memory pool with async memory allocation/deallocation when ↵Oleksii Maryshchenko
available (#3903) * Using cuda memory pools for async alloc/dealloc. * If cuda device doesnt support memory pool than use old implementation. * Removed redundant cublasSetStream --------- Co-authored-by: Oleksii Maryshchenko <omaryshchenko@dtis.com>
2023-11-02gguf : print error for GGUFv1 files (#3908)Georgi Gerganov
2023-11-02cmake : disable LLAMA_NATIVE by default (#3906)slaren
2023-11-02gguf : remove special-case code for GGUFv1 (#3901)Georgi Gerganov
ggml-ci
2023-11-02llm : prevent from 1-D tensors being GPU split (#3697)Georgi Gerganov
2023-11-02build : link against build info instead of compiling against it (#3879)cebtenzzre
* cmake : fix build when .git does not exist * cmake : simplify BUILD_INFO target * cmake : add missing dependencies on BUILD_INFO * build : link against build info instead of compiling against it * zig : make build info a .cpp source instead of a header Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com> * cmake : revert change to CMP0115 --------- Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>
2023-11-02cuda : check if this fixes Pascal card regression (#3882)Georgi Gerganov
2023-11-02metal : fix build errors and kernel sig after #2268 (#3898)Georgi Gerganov
2023-11-02cuda : fix RoPE after #2268 (#3897)cebtenzzre
2023-11-01llama : fix llama_context_default_params after #2268 (#3893)cebtenzzre
2023-11-01ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel (#3891)slaren
* ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel * fix warnings
2023-11-01llama : implement YaRN RoPE scaling (#2268)cebtenzzre
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>
2023-11-01llm : fix llm_build_kqv taking unused tensor (benign, #3837)Georgi Gerganov
2023-11-01llm : fix falcon norm after refactoring (#3837)Georgi Gerganov
2023-11-01metal : multi-simd softmax (#3710)Georgi Gerganov
ggml-ci
2023-11-01common : minor (#3715)Georgi Gerganov
2023-11-01llm : add llm_build_context (#3881)Georgi Gerganov
* llm : add llm_build_context * llm : deduce norm eps based on type + explict max_alibi_bias, clamp_kqv * llm : restore the non-graph llm_build_ functional API ggml-ci * llm : cleanup + comments
2023-11-01common : allow caller to handle help/argument exceptions (#3715)bandoti
* Allow caller to handle help/argument exceptions * Prepend newline to usage output * Add new gpt_params_parse_ex function to hide arg-parse impl * Fix issue blocking success case * exit instead of returning false * Update common/common.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/common.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-11-01log : make generating separate log files optional (#3787)staviq
* impl --log-new, --log-append * Update common/log.h Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> * Update common/log.h Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> * Apply suggestions from code review Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> --------- Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-11-01sampling : null grammar field after reset (#3885)l3utterfly
2023-11-01ggml : fix UNUSED macro (#3762)Georgi Gerganov
2023-11-01finetune : add -ngl parameter (#3762)Andrew Godfrey
* Add '-ngl' support to finetune.cpp * Add fprintf in ggml_cuda_op_add When I tried CUDA offloading during finetuning following the readme, I got an assert here. This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora * Add 'finetune.sh', which currently fails when using GPU "error: operator (): Finetuning on tensors with type 'f16' is not yet supported" * tweak finetune.sh * Suppress some warnings in ggml.c * Add f16 implementation to ggml_compute_forward_add_f16_f32 * Add an f16 case to ggml_add_cast_impl and llama_build_lora_finetune_graphs * finetune.sh: Edit comments * Add "add_f16_f32_f32_cuda" * Tweak an error message * finetune.sh: Add an optional LLAMA_MODEL_DIR variable * finetune.sh: Add an optional LLAMA_TRAINING_DIR variable * train : minor * tabs to spaces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-11-01scripts : add server-llm.sh (#3868)Georgi Gerganov
* scripts : add deploy-server.sh * scripts : rename to server-llm.sh * scripts : working curl pipe
2023-11-01server : re-enable completion and embedded at the same time (#3876)Adrian Hesketh
2023-11-01llama : refactor graph build code (#3837)Georgi Gerganov
* llama : factor out ggml-alloc from graph graph build functions ggml-ci * metal : disable kernel load log * llama : factor out tensor offloading outside the build call (wip) ggml-ci * llama : offload rest of the models ggml-ci * llama : update offload log messages to print node index * llama : comments * llama : support offloading result_norm + comments * llama : factor graph input into a function * llama : do tensor offload only with CUDA * llama : fix res_norm offloading * llama : try to optimize offloading code * llama : fix non-CUDA build * llama : try to fix build * llama : move refact in correct place + optimize graph input * llama : refactor tensor offloading as callback * llama : add layer index to all tensor names * llama : add functional header * llama : comment ggml-ci * llama : remove obsolete map for layer counting * llama : add llm_build helper functions (#3848) * llama : add llm_build_norm helper function ggml-ci * llama : add llm_build_ffn helper function (#3849) ggml-ci * llama : add llm_build_k_shift helper ggml-ci * llama : fix offloading after recent changes * llama : add llm_build_kv_store helper ggml-ci * llama : remove obsolete offload names * llama : fix llm_build_k_shift to use n_head_kv instead of n_head * llama : simplify falcon Q, K, V computation * llama : remove obsolete comments in build graphs * llama : add llm_build_kqv helper ggml-ci * llama : minor * llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading * llama : fix input allocation logic * llama : update offload functions for KQ tensors * llama : normalize tensor names ggml-ci * llama : enable warning about not offloaded tensors * llama : remove extra ; + deduplicate gate_b logic * llama : add llm_build_inp_embd helper
2023-10-31samplers : Min-P sampler implementation [alternative to Top P/Top K] (#3841)kalomaze
* Introduce the new Min-P sampler by @kalomaze The Min-P sampling method was designed as an alternative to Top-P, and aims to ensure a balance of quality and variety. The parameter *p* represents the minimum probability for a token to be considered, relative to the probability of the most likely token. * Min-P enabled and set to 0.05 default --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-10-31flake.nix: fix for rocm 5.7 (#3853)Tungsten842
2023-10-30ggml : move FP16 <-> FP32 code to ggml-impl.h (#3861)Georgi Gerganov
* ggml : move FP16 <-> FP32 stuff to ggml-impl.h ggml-ci * tests : fix ARM build * ggml : explicitly initialize deprecated type traits * ggml : add math.h to ggml-impl.h * ggml : remove duplicate static assert macros * ggml : prefix lookup tables with ggml_ ggml-ci * ggml-impl : move extern "C" to start of file
2023-10-29Extend llama_kv_cache_seq_rm to allow matching any sequence (#3843)Kerfuffle
* Extend llama_kv_cache_seq_rm to allow matichng any sequence * Replace llama_kv_cache_tokens_rm with llama_kv_cache_clear Use llama_kv_cache_clear for cache clearing Change calls to llama_kv_cache_tokens_rm that want to delete by position to use llama_kv_cache_seq_rm functionality
2023-10-29make : remove unnecessary dependency on build-info.h (#3842)cebtenzzre
2023-10-29llama : fix kv shift bug (#3835)Georgi Gerganov
ggml-ci
2023-10-29ggml : quantization refactoring (#3833)Georgi Gerganov
* ggml : factor all quantization code in ggml-quants ggml-ci * ggml-quants : fix Zig and Swift builds + quantize tool ggml-ci * quantize : --pure option for disabling k-quant mixtures --------- Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>