ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2023-11-02	metal : fix build errors and kernel sig after #2268 (#3898)	Georgi Gerganov

2023-11-02	cuda : fix RoPE after #2268 (#3897)	cebtenzzre

2023-11-01	llama : fix llama_context_default_params after #2268 (#3893)	cebtenzzre

2023-11-01	ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel (#3891)	slaren
	* ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel * fix warnings
2023-11-01	llama : implement YaRN RoPE scaling (#2268)	cebtenzzre
	Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>
2023-11-01	llm : fix llm_build_kqv taking unused tensor (benign, #3837)	Georgi Gerganov

2023-11-01	llm : fix falcon norm after refactoring (#3837)	Georgi Gerganov

2023-11-01	metal : multi-simd softmax (#3710)	Georgi Gerganov
	ggml-ci
2023-11-01	common : minor (#3715)	Georgi Gerganov

2023-11-01	llm : add llm_build_context (#3881)	Georgi Gerganov
	* llm : add llm_build_context * llm : deduce norm eps based on type + explict max_alibi_bias, clamp_kqv * llm : restore the non-graph llm_build_ functional API ggml-ci * llm : cleanup + comments
2023-11-01	common : allow caller to handle help/argument exceptions (#3715)	bandoti
	* Allow caller to handle help/argument exceptions * Prepend newline to usage output * Add new gpt_params_parse_ex function to hide arg-parse impl * Fix issue blocking success case * exit instead of returning false * Update common/common.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/common.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-11-01	log : make generating separate log files optional (#3787)	staviq
	* impl --log-new, --log-append * Update common/log.h Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> * Update common/log.h Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> * Apply suggestions from code review Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> --------- Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-11-01	sampling : null grammar field after reset (#3885)	l3utterfly

2023-11-01	ggml : fix UNUSED macro (#3762)	Georgi Gerganov

2023-11-01	finetune : add -ngl parameter (#3762)	Andrew Godfrey
	* Add '-ngl' support to finetune.cpp * Add fprintf in ggml_cuda_op_add When I tried CUDA offloading during finetuning following the readme, I got an assert here. This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora * Add 'finetune.sh', which currently fails when using GPU "error: operator (): Finetuning on tensors with type 'f16' is not yet supported" * tweak finetune.sh * Suppress some warnings in ggml.c * Add f16 implementation to ggml_compute_forward_add_f16_f32 * Add an f16 case to ggml_add_cast_impl and llama_build_lora_finetune_graphs * finetune.sh: Edit comments * Add "add_f16_f32_f32_cuda" * Tweak an error message * finetune.sh: Add an optional LLAMA_MODEL_DIR variable * finetune.sh: Add an optional LLAMA_TRAINING_DIR variable * train : minor * tabs to spaces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-11-01	scripts : add server-llm.sh (#3868)	Georgi Gerganov
	* scripts : add deploy-server.sh * scripts : rename to server-llm.sh * scripts : working curl pipe
2023-11-01	server : re-enable completion and embedded at the same time (#3876)	Adrian Hesketh

2023-11-01	llama : refactor graph build code (#3837)	Georgi Gerganov
	* llama : factor out ggml-alloc from graph graph build functions ggml-ci * metal : disable kernel load log * llama : factor out tensor offloading outside the build call (wip) ggml-ci * llama : offload rest of the models ggml-ci * llama : update offload log messages to print node index * llama : comments * llama : support offloading result_norm + comments * llama : factor graph input into a function * llama : do tensor offload only with CUDA * llama : fix res_norm offloading * llama : try to optimize offloading code * llama : fix non-CUDA build * llama : try to fix build * llama : move refact in correct place + optimize graph input * llama : refactor tensor offloading as callback * llama : add layer index to all tensor names * llama : add functional header * llama : comment ggml-ci * llama : remove obsolete map for layer counting * llama : add llm_build helper functions (#3848) * llama : add llm_build_norm helper function ggml-ci * llama : add llm_build_ffn helper function (#3849) ggml-ci * llama : add llm_build_k_shift helper ggml-ci * llama : fix offloading after recent changes * llama : add llm_build_kv_store helper ggml-ci * llama : remove obsolete offload names * llama : fix llm_build_k_shift to use n_head_kv instead of n_head * llama : simplify falcon Q, K, V computation * llama : remove obsolete comments in build graphs * llama : add llm_build_kqv helper ggml-ci * llama : minor * llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading * llama : fix input allocation logic * llama : update offload functions for KQ tensors * llama : normalize tensor names ggml-ci * llama : enable warning about not offloaded tensors * llama : remove extra ; + deduplicate gate_b logic * llama : add llm_build_inp_embd helper
2023-10-31	samplers : Min-P sampler implementation [alternative to Top P/Top K] (#3841)	kalomaze
	* Introduce the new Min-P sampler by @kalomaze The Min-P sampling method was designed as an alternative to Top-P, and aims to ensure a balance of quality and variety. The parameter p represents the minimum probability for a token to be considered, relative to the probability of the most likely token. * Min-P enabled and set to 0.05 default --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-10-31	flake.nix: fix for rocm 5.7 (#3853)	Tungsten842

2023-10-30	ggml : move FP16 <-> FP32 code to ggml-impl.h (#3861)	Georgi Gerganov
	* ggml : move FP16 <-> FP32 stuff to ggml-impl.h ggml-ci * tests : fix ARM build * ggml : explicitly initialize deprecated type traits * ggml : add math.h to ggml-impl.h * ggml : remove duplicate static assert macros * ggml : prefix lookup tables with ggml_ ggml-ci * ggml-impl : move extern "C" to start of file
2023-10-29	Extend llama_kv_cache_seq_rm to allow matching any sequence (#3843)	Kerfuffle
	* Extend llama_kv_cache_seq_rm to allow matichng any sequence * Replace llama_kv_cache_tokens_rm with llama_kv_cache_clear Use llama_kv_cache_clear for cache clearing Change calls to llama_kv_cache_tokens_rm that want to delete by position to use llama_kv_cache_seq_rm functionality
2023-10-29	make : remove unnecessary dependency on build-info.h (#3842)	cebtenzzre

2023-10-29	llama : fix kv shift bug (#3835)	Georgi Gerganov
	ggml-ci
2023-10-29	ggml : quantization refactoring (#3833)	Georgi Gerganov
	* ggml : factor all quantization code in ggml-quants ggml-ci * ggml-quants : fix Zig and Swift builds + quantize tool ggml-ci * quantize : --pure option for disabling k-quant mixtures --------- Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
2023-10-28	flake : update flake.lock for newer transformers version + provide extra dev ↵	Erik Scholz
	shell (#3797) * flake : update flake.lock for newer transformers version + provide extra dev shell with torch and transformers (for most convert-xxx.py scripts)
2023-10-28	metal : try cwd for ggml-metal.metal if bundle lookup fails (#3793)	Aarni Koskela
	* Try cwd for ggml-metal if bundle lookup fails When building with `-DBUILD_SHARED_LIBS=ON -DLLAMA_METAL=ON -DLLAMA_BUILD_SERVER=ON`, `server` would fail to load `ggml-metal.metal` because `[bundle pathForResource:...]` returns `nil`. In that case, fall back to `ggml-metal.metal` in the cwd instead of passing `null` as a path. Follows up on #1782 * Update ggml-metal.m --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-10-28	issues : change label from bug to bug-unconfirmed (#3748)	Georgi Gerganov

2023-10-28	convert : ignore tokens if their IDs are within [0, vocab_size) (#3831)	Georgi Gerganov

2023-10-28	llama : allow quantizing k-quants to fall back when tensor size incompatible ↵	Kerfuffle
	(#3747) * Allow quantizing k-quants to fall back when tensor size incompatible * quantizing: Add warning when tensors were incompatible with k-quants Clean up k-quants state passing a bit
2023-10-28	llama : add option for greedy sampling with probs (#3813)	Georgi Gerganov
	* llama : add option for greedy sampling with probs * llama : add comment about llama_sample_token_greedy() missing probs * sampling : temp == 0.0 -> no probs, temp < 0.0 -> probs
2023-10-28	common : print that one line of the syntax help also to standard output ↵	Henk Poley
	(#3823)
2023-10-28	starcoder : add GPU offloading (#3827)	Georgi Gerganov
	* starcoder : do not GPU split 1D bias tensors * starcoder : offload layers to GPU ggml-ci
2023-10-28	speculative : ensure draft and target model vocab matches (#3812)	Kerfuffle
	* speculative: Ensure draft and target model vocab matches * Tolerate small differences when checking dft vs tgt vocab
2023-10-27	llama : correctly report GGUFv3 format (#3818)	cebtenzzre

2023-10-27	simple : fix batch handling (#3803)	Thibault Terrasson

2023-10-27	cuda : improve text-generation and batched decoding performance (#3776)	Georgi Gerganov
	* cuda : prints wip * cuda : new cublas gemm branch for multi-batch quantized src0 * cuda : add F32 sgemm branch * cuda : fine-tune >= VOLTA params + use MMQ only for small batches * cuda : remove duplicated cuBLAS GEMM code * cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros * build : add compile option to force use of MMQ kernels
2023-10-26	server : do not release slot on image input (#3798)	Georgi Gerganov

2023-10-25	batched-bench : print params at start	Georgi Gerganov

2023-10-25	log : disable pid in log filenames	Georgi Gerganov

2023-10-24	server : add parameter -tb N, --threads-batch N (#3584) (#3768)	cebtenzzre
	Co-authored-by: Michael Coppola <m18coppola@gmail.com> Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
2023-10-24	server : do not block system prompt update (#3767)	Georgi Gerganov
	* server : do not block system prompt update * server : update state machine logic to process system prompts * server : minor
2023-10-24	sync : ggml (conv ops + cuda MSVC fixes) (#3765)	Georgi Gerganov
	ggml-ci
2023-10-24	cmake : add missed dependencies (#3763)	John Smith

2023-10-24	cuda : add batched cuBLAS GEMM for faster attention (#3749)	Georgi Gerganov
	* cmake : add helper for faster CUDA builds * batched : add NGL arg * ggml : skip nops in compute_forward * cuda : minor indentation * cuda : batched cuBLAS GEMMs for src0 F16 and src1 F32 (attention ops) * Apply suggestions from code review These changes plus: ```c++ #define cublasGemmBatchedEx hipblasGemmBatchedEx ``` are needed to compile with ROCM. I haven't done performance testing, but it seems to work. I couldn't figure out how to propose a change for lines outside what the pull changed, also this is the first time trying to create a multi-part review so please forgive me if I mess something up. * cuda : add ROCm / hipBLAS cublasGemmBatchedEx define * cuda : add cublasGemmStridedBatchedEx for non-broadcasted cases * cuda : reduce mallocs in cublasGemmBatchedEx branch * cuda : add TODO for calling cublas from kernel + using mem pool --------- Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
2023-10-24	Add more tokenizer tests (#3742)	Galunid
	* Add more tokenizer tests * Add starcoder * Update test vocab files * Restrict bpe tokenizer tests to unicode planes * Update comment * Comment cosmetics * Remove bloom vocab/test
2023-10-24	metal : handle ggml_scale for n%4 != 0 (close #3754)	Georgi Gerganov
	ggml-ci
2023-10-23	Revert "make : add optional CUDA_NATIVE_ARCH (#2482)"	Georgi Gerganov
	This reverts commit 96981f37b1e3f450d9e63e571514217bf60f0a7f. See: https://github.com/ggerganov/llama.cpp/pull/2482#issuecomment-1775975866
2023-10-23	issues : separate bug and enhancement template + no default title (#3748)	M. Yusuf Sarıgöz

2023-10-23	Update special token handling in conversion scripts for gpt2 derived ↵	Galunid
	tokenizers (#3746) We still have the heads up in `README.md` regarding `bpe` tokenizers and this patch is needed for - a couple of tokenizer tests - some more `special` and `non-special` added tokens handling (as far as I understand it) * Update special token handling * Add mpt