ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2024-01-09	server : update readme about token probs (#4777)	Behnam M
	* updated server readme to reflect the gg/server-token-probs-4088 commit added explanation for the API's completion result which now includes `completion_probabilities`. Also added a JSON schema that shows the type/structure of `completion_probabilities`. * simplified the `completion_probabilities` JSON schema It's now easier to understand what the structure of `completion_probabilities` looks like. * minor : fix trailing whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-09	server : add api-key flag to documentation (#4832)	Zsapi
	Document the api-key flag added to server in https://github.com/ggerganov/llama.cpp/pull/4441
2024-01-09	ggml : fix vld1q_s8_x4 32-bit compat (#4828)	Georgi Gerganov
	* ggml : fix vld1q_s8_x4 32-bit compat ggml-ci * ggml : fix 32-bit ARM compat (cont) ggml-ci
2024-01-09	CUDA: faster softmax via shared memory + fp16 math (#4742)	Johannes Gäßler

2024-01-08	common : fix the short form of `--grp-attn-w`, not `-gat` (#4825)	howlger
	See https://github.com/ggerganov/llama.cpp/blob/master/common/common.cpp#L230C53-L230C57
2024-01-08	readme : add link to SOTA models	Georgi Gerganov

2024-01-08	SOTA 2-bit quants (#4773)	Kawrakow
	* iq2_xxs: basics * iq2_xxs: scalar and AVX2 dot products Needed to change Q8_K to have quants in the -127...127 range, else the IQ2_XXS AVX implementation becomes very awkward. The alternative would have been to use Q8_0 instead. Perhaps I'll change later, for now this is what we have. * iq2_xxs: ARM_NEON dot product Somehow strangely slow (112 ms/token). * iq2_xxs: WIP Metal Dequantize works, something is still wrong with the dot product. * iq2_xxs: Metal dot product now works We have PP-512 = 475 t/s TG-128 = 47.3 t/s Not the greatest performance, but not complete garbage either. * iq2_xxs: slighty faster dot product TG-128 is now 48.4 t/s * iq2_xxs: slighty faster dot product TG-128 is now 50.9 t/s * iq2_xxs: even faster Metal dot product TG-128 is now 54.1 t/s. Strangely enough, putting the signs lookup table into shared memory has a bigger impact than the grid values being in shared memory. * iq2_xxs: dequantize CUDA kernel - fix conflict with master * iq2_xxs: quantized CUDA dot product (MMVQ) We get TG-128 = 153.1 t/s * iq2_xxs: slightly faster CUDA dot product TG-128 is now at 155.1 t/s. * iq2_xxs: add to llama ftype enum * iq2_xxs: fix MoE on Metal * Fix missing MMQ ops when on hipBLAS I had put the ggml_supports_mmq call at the wrong place. * Fix bug in qequantize_row_iq2_xxs The 0.25f factor was missing. Great detective work by @ggerganov! * Fixing tests * PR suggestion --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-08	swift : exclude ggml-metal.metal from the package (#4822)	Georgi Gerganov

2024-01-08	llama.swiftui : update readme	Georgi Gerganov

2024-01-08	main : add self-extend support (#4815)	Georgi Gerganov
	* examples : add passkey test * passkey : better prints * passkey : select pass key pos from CLI * passkey : simplify n_past logic * llama : "self-extend"-like context extension * passkey : add comment * main : add Self-Extend support * llama : add comment about llama_kv_cache_seq_div
2024-01-08	examples : add passkey test (#3856)	Georgi Gerganov
	* examples : add passkey test * passkey : better prints * passkey : select pass key pos from CLI * passkey : simplify n_past logic * make : add passkey target * passkey : add "self-extend"-like context extension (#4810) * llama : "self-extend"-like context extension * passkey : add comment * passkey : add readme
2024-01-07	readme : add lgrammel/modelfusion JS/TS client for llama.cpp (#4814)	Lars Grammel

2024-01-07	llama-bench : add no-kv-offload parameter (#4812)	slaren

2024-01-07	CUDA: fixed redundant value dequantization (#4809)	Johannes Gäßler

2024-01-07	llama : remove unused vars (#4796)	Georgi Gerganov

2024-01-07	llama : remove redundant GQA check (#4796)	Georgi Gerganov

2024-01-07	llama.swiftui : use llama.cpp as SPM package (#4804)	Alex Azarov

2024-01-07	llama : print tensor meta for debugging	Georgi Gerganov

2024-01-07	llama.swiftui : add visionOS target (#4805)	Alex Azarov

2024-01-07	ggml : use __builtin_amdgcn_sudot4 in __dp4a for gfx11 (#4787)	Konstantin Zhuravlyov

2024-01-07	server : fix n_predict check (#4798)	Georgi Gerganov

2024-01-06	llama.swiftui : use correct pointer for llama_token_eos (#4797)	Daniel Illescas Romero

2024-01-06	examples : improve base-translate.sh script (#4783)	Georgi Gerganov

2024-01-05	cmake : check for openblas64 (#4134)	a-n-n-a-l-e-e
	openblas v0.3.22 64-bit pkg-config file is named openblas64.pc https://github.com/OpenMathLib/OpenBLAS/issues/3790
2024-01-05	flake.nix : fix typo (#4700)	Ikko Eltociear Ashimine
	betwen -> between
2024-01-05	metal : switch back to default.metallib (ggml/681)	Georgi Gerganov
	ggml-ci
2024-01-05	ggml : fix q2_k bpw in comments (ggml/680)	Georgi Gerganov

2024-01-05	ggml : add error handling to graph_compute (whisper/1714)	Finn Voorhees

2024-01-05	ggml : do not sched_yield when calling BLAS (#4761)	Georgi Gerganov
	* ggml : do not sched_yield when calling BLAS ggml-ci * ggml : fix do_yield logic ggml-ci * ggml : simplify do_yield logic ggml-ci
2024-01-05	examples : add few-shot translation example (#4783)	Georgi Gerganov

2024-01-04	finetune : remove unused includes (#4756)	Daniel Bevenius
	This commit removes unused includes from finetune.cpp. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-04	server : send token probs for "stream == false" (#4714)	Georgi Gerganov

2024-01-04	Print backend name on test-backend-ops failure (#4751)	Johannes Gäßler

2024-01-04	llama.swiftui : support loading custom model from file picker (#4767)	singularity
	* swiftui: support load model from file picker * swiftui: remove trailing whitespace
2024-01-04	server : fix options in README.md (#4765)	Michael Coppola
	* fix examples/server/README.md * minor : fix whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-04	ggml : include stdlib.h before intrin.h (#4736)	Georgi Gerganov

2024-01-04	llama.swiftui : fix build of ggml.metallib (#4754)	singularity
	* metal: fix metal backend init failure in swiftui * metal: build ggml.metallib instead of copy src * llama.swift : remove debug flags from metallib build --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-03	train : fix typo in overlapping-samples help msg (#4758)	Daniel Bevenius
	This commit fixes a typo in the help message for the --overlapping-samples option. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-03	swift : update Package.swift to use ggml as dependency (#4691)	Ashraful Islam
	* updates the package.swift to use ggml as dependency * changes the ggml package url src to ggerganov
2024-01-03	cuda : simplify expression	Georgi Gerganov
	Co-authored-by: slaren <slarengh@gmail.com>
2024-01-03	cuda : mark I16 and I32 ops as unsupported	Georgi Gerganov
	ggml-ci
2024-01-03	sync : ggml	Georgi Gerganov
	ggml-ci
2024-01-03	metal : add kernel_get_rows_i32	Georgi Gerganov
	ggml-ci
2024-01-03	scripts : fix sync order + metal sed	Georgi Gerganov

2024-01-03	ggml : extend ggml_get_rows, ggml_repeat, ggml_concat (ggml/639)	Guillaume Wenzek
	* add more int ops * ggml_compute_forward_dup_bytes * add tests * PR comments * tests : minor indentations --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-03	server : throw an error when `slot unavailable` (#4741)	Justin Parker

2024-01-02	metal : optimize ggml_mul_mat_id (faster Mixtral PP) (#4725)	Georgi Gerganov
	* ggml : disable fast-math for Metal (cmake build only) ggml-ci * metal : fix Metal API debug warnings * cmake : add -fno-inline for Metal build (#4545) * metal : fix API debug warnings * metal : fix compile warnings * metal : use uint64_t for strides * cmake : rename option to LLAMA_METAL_SHADER_DEBUG * metal : fix mat-vec Q8_0 kernel for BS > 1 * metal : normalize mat-vec kernel signatures * cmake : respect LLAMA_QKK_64 option * metal : fix mat-vec Q4_K kernel for QK_K == 64 * metal : optimizing ggml_mul_mat_id (wip) * metal : minor fix * metal : opt mul_mm_id
2024-01-02	server : add token counts to html footer (#4738)	Phil H
	* server: add token counts to stats * server: generate hpp --------- Co-authored-by: phiharri <ph@got-root.co.uk>
2024-01-02	llama : llama_model_desc print number of experts	Georgi Gerganov

2024-01-02	llama : replace all API facing `int`'s with `int32_t` (#4577)	Marcus Dunn
	* replaced all API facing `int`'s with `int32_t` * formatting and missed `int` in `llama_token_to_piece`