summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-01-10clip : support more quantization types (#4846)John
Uses ggml functions instead of hardcoded names and adds support to quantize into the modern Q-K variants. This is just the bare minimum to get k-types working - a more refined choice of types would be needed to get best quality on low quantizations. I ran a few tests, it doesn't break anything I could notice and a Q6_K ViT works almost as well as Q8_0 but 3 times the inference speed.
2024-01-10Python script to compare commits with llama-bench (#4844)Johannes Gäßler
2024-01-09convert.py : fix vanilla LLaMA model conversion (#4818)Austin
* Update Imports and Add Notes for Future Reference - Updated import statements in `convert.py`. - Added import for `AutoTokenizer` from `transformers` module. - Added conditional import for `gguf` from the local directory. - Added comments and notes for future reference. Additional Notes: - Noted removal of a redundant `TypeAlias` import. - Noted the removal of a `gguf` debug statement. - Commented on the presence of `ARCH` and `NDArray` definitions. - Commented on cleaning up and refactoring data type definitions. * Refine Model Hyperparameters and Params Class - Updated type annotations to use `Optional` for clarity. - Improved method names and attribute consistency. - Removed unnecessary variables for better code readability. Additional Notes: - Highlighted the use of `Optional` for clearer intent. - Ensured backward and forward compatibility. * Restore BpeVocab and SentencePieceVocab classes - Restored the BpeVocab class for handling BPE tokenization. - Restored the SentencePieceVocab class for SentencePiece tokenization. These classes are essential for maintaining the original behavior of the codebase. * refactor: Standardize vocabulary handling with HfVocab - Replaced VocabLoader with HfVocab, aligning vocabulary handling across classes. - Updated initialization of HfVocab with local_files_only=True for AutoTokenizer. - Introduced optional parameter fname_added_tokens for flexible added token management. - Streamlined added token handling for clarity and conciseness. - Maintained special tokens and IDs, enhancing token management. - Simplified token processing methods for improved readability. - Added a placeholder for score computation with a default value of -1000.0. - Optimized newline token check for efficiency. - Updated __repr__ function for clarity in representation. - Adjusted type alias Vocab to include BpeVocab, SentencePieceVocab, and HfVocab. - Removed redundant code related to special token handling, reverse vocabulary mapping, and vocabulary file detection. This refactoring promotes a standardized and modular approach to vocabulary management, facilitating future integration with a VocabFactory and improving code maintainability and scalability. * refactor: Enhance readability, functionality, and code quality - Improved code formatting and readability for better maintainability. - Refactored LazyUnpickler's CLASSES dictionary for clarity. - Added print statements and warnings in check_vocab_size for user feedback. - Removed find_vocab_file_path, as it's superseded by VocabFactory. - Preparatory changes for upcoming classes: OutputFile and VocabFactory. - Overall focus on code quality, error handling, and consistency. These changes reflect a continuous effort to refine the codebase, ensuring it meets best practices and prepares for future enhancements, such as the VocabFactory. * refactor: Update OutputFile class for enhanced model vocabulary management - Restructured the constructor for improved readability. - Updated `add_meta_arch` method for flexible model name determination. - Introduced `handle_tokenizer_model` for mapping vocab types to supported tokenizer models. - Streamlined vocabulary extraction with `extract_vocabulary_from_model`. - Simplified vocabulary metadata addition using `add_meta_vocab`. - Refactored `add_tensor_info` for clarity and consistency. - Improved error handling for better user feedback. These changes signify the development of a versatile and comprehensive `OutputFile` class, enabling efficient management of model conversion output, metadata, vocabulary, and tensor information. * feat: Introduce VocabFactory for flexible vocabulary management in model conversion - The VocabFactory class is added to facilitate modular vocabulary handling. - The constructor initializes a directory path and detects vocabulary-related files. - The _select_file method provides file paths based on vocabulary type (e.g., BPE, SentencePiece). - _create_special_vocab generates special vocabularies, accommodating different types. - The load_vocab method loads vocabularies, handling BPE, SentencePiece, and Hugging Face Fast Tokenizer. - Error handling and logging enhance debugging and user feedback. - The modular and flexible design simplifies vocabulary management and supports future extensions. The VocabFactory class enhances code modularity and maintainability, allowing versatile vocabulary handling in the model conversion process. * refactor: Improve code organization, argument parsing, and user interface - Renamed 'default_outfile' to 'default_output_file' for clarity. - Refactored argument parser setup into 'get_argument_parser' function. - Introduced descriptive comments for each argument in the parser. - Added '--vocab-type' argument with choices ["spm", "bpe", "hfft"] for vocabulary processing. - Improved flag naming consistency: '--outfile' to '--out-file' and '--bigendian' to '--big-endian'. - Enhanced error handling to prevent overwriting input data in 'default_output_file'. - Made 'argv' in 'main' an optional parameter for flexibility. - Introduced dynamic import for 'awq.apply_awq' based on 'args.awq_path' for conditional dependency. These changes enhance code clarity, organization, and the user interface of the script, aligning it with Python best practices and improving maintainability. * refactor: Further refine functionality, improve user interaction, and streamline vocabulary handling - Renamed command-line arguments for clarity and consistency. - Improved path resolution and import adjustments for robustness. - Thoughtfully handled 'awq-path' and conditional logic for the weighted model. - Enhanced model and vocabulary loading with the 'VocabFactory' class for structured and adaptable loading. - Strengthened error handling and user feedback for a more user-friendly experience. - Structured output file handling with clear conditions and defaults. - Streamlined and organized the 'main' function for better logic flow. - Passed 'sys.argv[1:]' to 'main' for adaptability and testability. These changes solidify the script's functionality, making it more robust, user-friendly, and adaptable. The use of the 'VocabFactory' class is a notable enhancement in efficient vocabulary handling, reflecting a thoughtful and iterative approach to script development. * chore: Apply ruff formatting to convert.py Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com> * Revert to commit 0614c33 * chore: Apply flake8 formatting rules Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com> * refactor: Revise `check_vocab_size` for Enhanced Clarity and Correctness - Resolved an unreachable branch issue by reorganizing the conditional structure. - Moved the special case check for `params.n_vocab == -1` to the top for immediate assertion. - Flattened the conditional logic for improved clarity and predictability of the function's behavior. These changes enhance the readability and functional correctness of the `check_vocab_size` function without altering its intended functionality. * py : fix outfile and outtype * py : suggest hint for missing vocab size --------- Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-09llava-cli : don't crash if --image flag is invalid (#4835)Justine Tunney
This change fixes an issue where supplying `--image missing-file` would result in a segfault due to a null pointer being dereferenced. This can result in distracting info being printed if robust crash analysis tools are being used.
2024-01-09metal : improve dequantize precision to match CPU (#4836)Georgi Gerganov
ggml-ci
2024-01-09scripts : improve get-pg.sh (#4838)Georgi Gerganov
2024-01-09readme : add 3rd party collama reference to UI list (#4840)iohub
Add a VSCode extension for llama.cpp reference to UI list
2024-01-09scripts : script to get Paul Graham essays in txt format (#4838)Georgi Gerganov
2024-01-09server : update readme about token probs (#4777)Behnam M
* updated server readme to reflect the gg/server-token-probs-4088 commit added explanation for the API's completion result which now includes `completion_probabilities`. Also added a JSON schema that shows the type/structure of `completion_probabilities`. * simplified the `completion_probabilities` JSON schema It's now easier to understand what the structure of `completion_probabilities` looks like. * minor : fix trailing whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-09server : add api-key flag to documentation (#4832)Zsapi
Document the api-key flag added to server in https://github.com/ggerganov/llama.cpp/pull/4441
2024-01-09ggml : fix vld1q_s8_x4 32-bit compat (#4828)Georgi Gerganov
* ggml : fix vld1q_s8_x4 32-bit compat ggml-ci * ggml : fix 32-bit ARM compat (cont) ggml-ci
2024-01-09CUDA: faster softmax via shared memory + fp16 math (#4742)Johannes Gäßler
2024-01-08common : fix the short form of `--grp-attn-w`, not `-gat` (#4825)howlger
See https://github.com/ggerganov/llama.cpp/blob/master/common/common.cpp#L230C53-L230C57
2024-01-08readme : add link to SOTA modelsGeorgi Gerganov
2024-01-08SOTA 2-bit quants (#4773)Kawrakow
* iq2_xxs: basics * iq2_xxs: scalar and AVX2 dot products Needed to change Q8_K to have quants in the -127...127 range, else the IQ2_XXS AVX implementation becomes very awkward. The alternative would have been to use Q8_0 instead. Perhaps I'll change later, for now this is what we have. * iq2_xxs: ARM_NEON dot product Somehow strangely slow (112 ms/token). * iq2_xxs: WIP Metal Dequantize works, something is still wrong with the dot product. * iq2_xxs: Metal dot product now works We have PP-512 = 475 t/s TG-128 = 47.3 t/s Not the greatest performance, but not complete garbage either. * iq2_xxs: slighty faster dot product TG-128 is now 48.4 t/s * iq2_xxs: slighty faster dot product TG-128 is now 50.9 t/s * iq2_xxs: even faster Metal dot product TG-128 is now 54.1 t/s. Strangely enough, putting the signs lookup table into shared memory has a bigger impact than the grid values being in shared memory. * iq2_xxs: dequantize CUDA kernel - fix conflict with master * iq2_xxs: quantized CUDA dot product (MMVQ) We get TG-128 = 153.1 t/s * iq2_xxs: slightly faster CUDA dot product TG-128 is now at 155.1 t/s. * iq2_xxs: add to llama ftype enum * iq2_xxs: fix MoE on Metal * Fix missing MMQ ops when on hipBLAS I had put the ggml_supports_mmq call at the wrong place. * Fix bug in qequantize_row_iq2_xxs The 0.25f factor was missing. Great detective work by @ggerganov! * Fixing tests * PR suggestion --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-08swift : exclude ggml-metal.metal from the package (#4822)Georgi Gerganov
2024-01-08llama.swiftui : update readmeGeorgi Gerganov
2024-01-08main : add self-extend support (#4815)Georgi Gerganov
* examples : add passkey test * passkey : better prints * passkey : select pass key pos from CLI * passkey : simplify n_past logic * llama : "self-extend"-like context extension * passkey : add comment * main : add Self-Extend support * llama : add comment about llama_kv_cache_seq_div
2024-01-08examples : add passkey test (#3856)Georgi Gerganov
* examples : add passkey test * passkey : better prints * passkey : select pass key pos from CLI * passkey : simplify n_past logic * make : add passkey target * passkey : add "self-extend"-like context extension (#4810) * llama : "self-extend"-like context extension * passkey : add comment * passkey : add readme
2024-01-07readme : add lgrammel/modelfusion JS/TS client for llama.cpp (#4814)Lars Grammel
2024-01-07llama-bench : add no-kv-offload parameter (#4812)slaren
2024-01-07CUDA: fixed redundant value dequantization (#4809)Johannes Gäßler
2024-01-07llama : remove unused vars (#4796)Georgi Gerganov
2024-01-07llama : remove redundant GQA check (#4796)Georgi Gerganov
2024-01-07llama.swiftui : use llama.cpp as SPM package (#4804)Alex Azarov
2024-01-07llama : print tensor meta for debuggingGeorgi Gerganov
2024-01-07llama.swiftui : add visionOS target (#4805)Alex Azarov
2024-01-07ggml : use __builtin_amdgcn_sudot4 in __dp4a for gfx11 (#4787)Konstantin Zhuravlyov
2024-01-07server : fix n_predict check (#4798)Georgi Gerganov
2024-01-06llama.swiftui : use correct pointer for llama_token_eos (#4797)Daniel Illescas Romero
2024-01-06examples : improve base-translate.sh script (#4783)Georgi Gerganov
2024-01-05cmake : check for openblas64 (#4134)a-n-n-a-l-e-e
openblas v0.3.22 64-bit pkg-config file is named openblas64.pc https://github.com/OpenMathLib/OpenBLAS/issues/3790
2024-01-05flake.nix : fix typo (#4700)Ikko Eltociear Ashimine
betwen -> between
2024-01-05metal : switch back to default.metallib (ggml/681)Georgi Gerganov
ggml-ci
2024-01-05ggml : fix q2_k bpw in comments (ggml/680)Georgi Gerganov
2024-01-05ggml : add error handling to graph_compute (whisper/1714)Finn Voorhees
2024-01-05ggml : do not sched_yield when calling BLAS (#4761)Georgi Gerganov
* ggml : do not sched_yield when calling BLAS ggml-ci * ggml : fix do_yield logic ggml-ci * ggml : simplify do_yield logic ggml-ci
2024-01-05examples : add few-shot translation example (#4783)Georgi Gerganov
2024-01-04finetune : remove unused includes (#4756)Daniel Bevenius
This commit removes unused includes from finetune.cpp. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-04server : send token probs for "stream == false" (#4714)Georgi Gerganov
2024-01-04Print backend name on test-backend-ops failure (#4751)Johannes Gäßler
2024-01-04llama.swiftui : support loading custom model from file picker (#4767)singularity
* swiftui: support load model from file picker * swiftui: remove trailing whitespace
2024-01-04server : fix options in README.md (#4765)Michael Coppola
* fix examples/server/README.md * minor : fix whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-04ggml : include stdlib.h before intrin.h (#4736)Georgi Gerganov
2024-01-04llama.swiftui : fix build of ggml.metallib (#4754)singularity
* metal: fix metal backend init failure in swiftui * metal: build ggml.metallib instead of copy src * llama.swift : remove debug flags from metallib build --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-03train : fix typo in overlapping-samples help msg (#4758)Daniel Bevenius
This commit fixes a typo in the help message for the --overlapping-samples option. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-01-03swift : update Package.swift to use ggml as dependency (#4691)Ashraful Islam
* updates the package.swift to use ggml as dependency * changes the ggml package url src to ggerganov
2024-01-03cuda : simplify expressionGeorgi Gerganov
Co-authored-by: slaren <slarengh@gmail.com>
2024-01-03cuda : mark I16 and I32 ops as unsupportedGeorgi Gerganov
ggml-ci
2024-01-03sync : ggmlGeorgi Gerganov
ggml-ci