ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2024-01-13	convert : update phi-2 to latest HF repo (#4903)	Georgi Gerganov
	* convert : update phi-2 to latest HF repo ggml-ci * py : try to fix flake stuff
2024-01-12	sync : ggml	Georgi Gerganov

2024-01-12	ggml : fix 32-bit ARM compat for IQ2_XS (whisper/1758)	Georgi Gerganov
	* ggml : fix 32-bit ARM compat * ggml : fix fix * ggml : fix fix fix
2024-01-12	backend_sched : fix assignments	slaren
	ggml-ci
2024-01-12	examples : add pydantic models to GBNF grammar generator (#4883)	Maximilian Winter
	* Create pydantic-models-to-grammar.py * Added some comments for usage * Refactored Grammar Generator Added example and usage instruction. * Update pydantic_models_to_grammar.py * Update pydantic-models-to-grammar-examples.py * Renamed module and imported it. * Update pydantic-models-to-grammar.py * Renamed file and fixed grammar generator issue.
2024-01-12	CUDA: faster q8_0 -> f16 dequantization (#4895)	Johannes Gäßler

2024-01-12	llama : ggml-backend integration (#4766)	slaren
	* llama : ggml-backend integration * ggml-backend : add names to buffers * fix unmap after loading * batched-bench : add tensor_split param * llama : check for null tensor_split * ggml-backend : increase GGML_MAX_BACKENDS * improve graph splitting, partial fix for --no-kv-offload * cuda : add ggml-backend split buffer support * cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available) * ggml : fix null backend dereference (#4807) * ggml : fix null backend dereference * ggml : also check ggml_backend_is_cpu * test-backend-ops : check buffer allocation failures * llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row) * ggml : fix mul_mat_id work size * llama : rewrite session kv load/set without graphs * minor * llama : only initialize used backends, free backends on context free * llama : abort ctx if cuda backend init fails * llama : rewrite lora with ggml-backend and compute on CPU ggml-ci * llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer * opencl : add ggml-backend buffer type * cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf) * llama : on Metal, by default offload the full model ggml-ci * metal : page align the data ptr (#4854) * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix split buffer free * address review comments * llama-bench : add split-mode parameter * fix whitespace * opencl : fix double initialization * server : add --split-mode parameter * use async copy and compute to improve multi-gpu performance ggml-ci * use async memcpys to copy the graph outputs to the CPU * fix opencl * use a host buffer for the cpu compute buffer for faster copies to the gpu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-01-12	llama : remove redundant assert for StableLM (#4901)	Georgi Gerganov

2024-01-12	export-lora : use LLAMA_FILE_MAGIC_GGLA (#4894)	Daniel Bevenius
	This commit replaces the magic number used in export-lora.cpp with the one defined in llama.h, which is indirectly included via common.h. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-12	llama.swiftui : update models layout (#4826)	Zay
	* Updated Models Layout - Added a models drawer - Added downloading directly from Hugging Face - Load custom models from local folder - Delete models by swiping left * trimmed trailing white space * Updated Models Layout
2024-01-12	gitignore : imatrix	Georgi Gerganov

2024-01-12	CUDA: fix softmax compile for old CUDA versions (#4862)	Johannes Gäßler

2024-01-12	llama : fix typo "imp_embd" -> "inp_embd"	Georgi Gerganov

2024-01-12	common : streamline the formatting of help (#4890)	howlger
	* common : streamline the formatting of help - Separate alternative parameters by a comma - Do not indent `--version` differently * Update common/common.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-12	py : fix lint (#4889)	Georgi Gerganov

2024-01-12	llama : fix llm_build_k_shift to use correct n_rot (#4889)	Georgi Gerganov
	* llama : fix llm_build_k_shift to use correct n_rot ggml-ci * llama : always use hparams.n_rot for ggml_rope_custom ggml-ci * convert : fix persimmon conversion to write correct n_rot
2024-01-12	Importance Matrix calculation (#4861)	Kawrakow
	* imatrix: 1st version * imatrix: WIP * Cleanup * Update examples/imatrix/imatrix.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-11	server : fix infill when prompt is empty (#4833)	Georgi Gerganov

2024-01-11	main : better name for variable n_print (#4874)	Georgi Gerganov

2024-01-11	main : disable token count by default (#4874)	Georgi Gerganov

2024-01-11	swift : track ggml release branch (#4867)	Georgi Gerganov

2024-01-11	llama : restore intended k-quants mixes for MoE models (#4872)	Kawrakow
	* Restore intended k-quants quantization mixes for MoE models * Update Q2_K_S values in the quantize tool Still using LLaMA-v1 PPL values in the quant description today does not make much sense. But let's leave this update for another PR. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-11	ggml : SOTA 2-bit quants (add IQ2_XS) (#4856)	Kawrakow
	* iq2_xs: basics * iq2_xs: this should have been in the basics * iq2_xs: CUDA and scalar CPU works * iq2_xs: WIP Metal * iq2_xs: Metal now works * iq2_xs: working, but dog slow, ARM_NEON dot product * iq2_xs: better ARM_NEON dot product We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when running on the CPU. * iq2_xs: AVX2 dot product - 19.5 t/s * iq2_xs: faster AVX2 dit product 21.4 t/s for TG-128, 59.2 t/s for PP-512. The latter is 2x compared to the previous version. * iq2_xs: had forgotten to delete iq2-data.h * Add llama enum for IQ2_XS --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-11	swift : pin ggml commit + remove ggml.h from spm-headers (#4878)	Georgi Gerganov
	ggml-ci
2024-01-11	server : implement credentialed CORS (#4514)	Laura
	* Implement credentialed CORS according to MDN * Fix syntax error * Move validate_api_key up so it is defined before its first usage
2024-01-11	server : support for multiple api keys (#4864)	Michael Coppola
	* server: added support for multiple api keys, added loading api keys from file * minor: fix whitespace * added file error handling to --api-key-file, changed code to better reflect current style * server: update README.md for --api-key-file --------- Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
2024-01-11	server : add `LOG_INFO` when model is successfully loaded (#4881)	Behnam M
	* added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line * updated `server` readme to document the `/health` endpoint too * used LOG_INFO after successful model loading
2024-01-11	ci: nix-flake-update: new token with pr permissions (#4879)	Someone
	* ci: nix-flake-update: new token with pr permissions --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-11	main : print total token count and tokens consumed so far (#4874)	pudepiedj
	* Token count changes * Add show token count * Updating before PR * Two requested changes * Move param def posn
2024-01-11	server : fix typo in model name (#4876)	Isaac McFadyen

2024-01-11	metal : put encoder debug group behind a define (#4873)	Paul Tsochantaris

2024-01-11	sync : ggml	Georgi Gerganov

2024-01-11	metal : fix deprecation warning (ggml/690)	Georgi Gerganov

2024-01-11	ggml : remove ggml_cpy_inplace and ggml_cont_inplace (ggml/693)	Timothy Cronin

2024-01-11	metal : wrap each operation in debug group (ggml/690)	Jack Mousseau

2024-01-11	ggml : change GGML_MAX_NAME at compile time (ggml/682)	leejet
	* change GGML_MAX_NAME to 128 * allow controlling the value of GGML_MAX_NAME through external macro definitions
2024-01-11	Fix execlp call (ggml/689)	Halalaluyafail3
	NULL can be an integer constant expression with the value zero, in this case the behavior would be undefined because of an incorrect type being passed to the variable arguments.
2024-01-11	fix : cuda order of synchronization when setting a buffer (ggml/679)	Erik Scholz
	* fix : cuda order of synchronization when setting a buffer * also sync before memcpy --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-01-11	server : update readme to document the new `/health` endpoint (#4866)	Behnam M
	* added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line * updated `server` readme to document the `/health` endpoint too
2024-01-11	server : fix build + rename enums (#4870)	Georgi Gerganov

2024-01-10	server : add a `/health` endpoint (#4860)	Behnam M
	* added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line
2024-01-10	llama : add additional suffixes for model params (#4834)	Brian
	* llm_load_print_meta: Add additional suffixs for model params * Update llama.cpp model param log remove unneeded comments and convert from > to >=
2024-01-10	llama : recognize 1B phi models (#4847)	Austin
	This update categorizes models with 24 layers as MODEL_1B, ensuring compatibility with different Phi model variants without impacting existing Phi-2 model functionality.
2024-01-10	clip : support more quantization types (#4846)	John
	Uses ggml functions instead of hardcoded names and adds support to quantize into the modern Q-K variants. This is just the bare minimum to get k-types working - a more refined choice of types would be needed to get best quality on low quantizations. I ran a few tests, it doesn't break anything I could notice and a Q6_K ViT works almost as well as Q8_0 but 3 times the inference speed.
2024-01-10	Python script to compare commits with llama-bench (#4844)	Johannes Gäßler

2024-01-09	convert.py : fix vanilla LLaMA model conversion (#4818)	Austin
	* Update Imports and Add Notes for Future Reference - Updated import statements in `convert.py`. - Added import for `AutoTokenizer` from `transformers` module. - Added conditional import for `gguf` from the local directory. - Added comments and notes for future reference. Additional Notes: - Noted removal of a redundant `TypeAlias` import. - Noted the removal of a `gguf` debug statement. - Commented on the presence of `ARCH` and `NDArray` definitions. - Commented on cleaning up and refactoring data type definitions. * Refine Model Hyperparameters and Params Class - Updated type annotations to use `Optional` for clarity. - Improved method names and attribute consistency. - Removed unnecessary variables for better code readability. Additional Notes: - Highlighted the use of `Optional` for clearer intent. - Ensured backward and forward compatibility. * Restore BpeVocab and SentencePieceVocab classes - Restored the BpeVocab class for handling BPE tokenization. - Restored the SentencePieceVocab class for SentencePiece tokenization. These classes are essential for maintaining the original behavior of the codebase. * refactor: Standardize vocabulary handling with HfVocab - Replaced VocabLoader with HfVocab, aligning vocabulary handling across classes. - Updated initialization of HfVocab with local_files_only=True for AutoTokenizer. - Introduced optional parameter fname_added_tokens for flexible added token management. - Streamlined added token handling for clarity and conciseness. - Maintained special tokens and IDs, enhancing token management. - Simplified token processing methods for improved readability. - Added a placeholder for score computation with a default value of -1000.0. - Optimized newline token check for efficiency. - Updated __repr__ function for clarity in representation. - Adjusted type alias Vocab to include BpeVocab, SentencePieceVocab, and HfVocab. - Removed redundant code related to special token handling, reverse vocabulary mapping, and vocabulary file detection. This refactoring promotes a standardized and modular approach to vocabulary management, facilitating future integration with a VocabFactory and improving code maintainability and scalability. * refactor: Enhance readability, functionality, and code quality - Improved code formatting and readability for better maintainability. - Refactored LazyUnpickler's CLASSES dictionary for clarity. - Added print statements and warnings in check_vocab_size for user feedback. - Removed find_vocab_file_path, as it's superseded by VocabFactory. - Preparatory changes for upcoming classes: OutputFile and VocabFactory. - Overall focus on code quality, error handling, and consistency. These changes reflect a continuous effort to refine the codebase, ensuring it meets best practices and prepares for future enhancements, such as the VocabFactory. * refactor: Update OutputFile class for enhanced model vocabulary management - Restructured the constructor for improved readability. - Updated `add_meta_arch` method for flexible model name determination. - Introduced `handle_tokenizer_model` for mapping vocab types to supported tokenizer models. - Streamlined vocabulary extraction with `extract_vocabulary_from_model`. - Simplified vocabulary metadata addition using `add_meta_vocab`. - Refactored `add_tensor_info` for clarity and consistency. - Improved error handling for better user feedback. These changes signify the development of a versatile and comprehensive `OutputFile` class, enabling efficient management of model conversion output, metadata, vocabulary, and tensor information. * feat: Introduce VocabFactory for flexible vocabulary management in model conversion - The VocabFactory class is added to facilitate modular vocabulary handling. - The constructor initializes a directory path and detects vocabulary-related files. - The _select_file method provides file paths based on vocabulary type (e.g., BPE, SentencePiece). - _create_special_vocab generates special vocabularies, accommodating different types. - The load_vocab method loads vocabularies, handling BPE, SentencePiece, and Hugging Face Fast Tokenizer. - Error handling and logging enhance debugging and user feedback. - The modular and flexible design simplifies vocabulary management and supports future extensions. The VocabFactory class enhances code modularity and maintainability, allowing versatile vocabulary handling in the model conversion process. * refactor: Improve code organization, argument parsing, and user interface - Renamed 'default_outfile' to 'default_output_file' for clarity. - Refactored argument parser setup into 'get_argument_parser' function. - Introduced descriptive comments for each argument in the parser. - Added '--vocab-type' argument with choices ["spm", "bpe", "hfft"] for vocabulary processing. - Improved flag naming consistency: '--outfile' to '--out-file' and '--bigendian' to '--big-endian'. - Enhanced error handling to prevent overwriting input data in 'default_output_file'. - Made 'argv' in 'main' an optional parameter for flexibility. - Introduced dynamic import for 'awq.apply_awq' based on 'args.awq_path' for conditional dependency. These changes enhance code clarity, organization, and the user interface of the script, aligning it with Python best practices and improving maintainability. * refactor: Further refine functionality, improve user interaction, and streamline vocabulary handling - Renamed command-line arguments for clarity and consistency. - Improved path resolution and import adjustments for robustness. - Thoughtfully handled 'awq-path' and conditional logic for the weighted model. - Enhanced model and vocabulary loading with the 'VocabFactory' class for structured and adaptable loading. - Strengthened error handling and user feedback for a more user-friendly experience. - Structured output file handling with clear conditions and defaults. - Streamlined and organized the 'main' function for better logic flow. - Passed 'sys.argv[1:]' to 'main' for adaptability and testability. These changes solidify the script's functionality, making it more robust, user-friendly, and adaptable. The use of the 'VocabFactory' class is a notable enhancement in efficient vocabulary handling, reflecting a thoughtful and iterative approach to script development. * chore: Apply ruff formatting to convert.py Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com> * Revert to commit 0614c33 * chore: Apply flake8 formatting rules Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com> * refactor: Revise `check_vocab_size` for Enhanced Clarity and Correctness - Resolved an unreachable branch issue by reorganizing the conditional structure. - Moved the special case check for `params.n_vocab == -1` to the top for immediate assertion. - Flattened the conditional logic for improved clarity and predictability of the function's behavior. These changes enhance the readability and functional correctness of the `check_vocab_size` function without altering its intended functionality. * py : fix outfile and outtype * py : suggest hint for missing vocab size --------- Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-09	llava-cli : don't crash if --image flag is invalid (#4835)	Justine Tunney
	This change fixes an issue where supplying `--image missing-file` would result in a segfault due to a null pointer being dereferenced. This can result in distracting info being printed if robust crash analysis tools are being used.
2024-01-09	metal : improve dequantize precision to match CPU (#4836)	Georgi Gerganov
	ggml-ci
2024-01-09	scripts : improve get-pg.sh (#4838)	Georgi Gerganov

2024-01-09	readme : add 3rd party collama reference to UI list (#4840)	iohub
	Add a VSCode extension for llama.cpp reference to UI list