summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-01-11metal : put encoder debug group behind a define (#4873)Paul Tsochantaris
2024-01-11sync : ggmlGeorgi Gerganov
2024-01-11metal : fix deprecation warning (ggml/690)Georgi Gerganov
2024-01-11ggml : remove ggml_cpy_inplace and ggml_cont_inplace (ggml/693)Timothy Cronin
2024-01-11metal : wrap each operation in debug group (ggml/690)Jack Mousseau
2024-01-11ggml : change GGML_MAX_NAME at compile time (ggml/682)leejet
* change GGML_MAX_NAME to 128 * allow controlling the value of GGML_MAX_NAME through external macro definitions
2024-01-11Fix execlp call (ggml/689)Halalaluyafail3
NULL can be an integer constant expression with the value zero, in this case the behavior would be undefined because of an incorrect type being passed to the variable arguments.
2024-01-11fix : cuda order of synchronization when setting a buffer (ggml/679)Erik Scholz
* fix : cuda order of synchronization when setting a buffer * also sync before memcpy --------- Co-authored-by: slaren <slarengh@gmail.com>
2024-01-11server : update readme to document the new `/health` endpoint (#4866)Behnam M
* added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line * updated `server` readme to document the `/health` endpoint too
2024-01-11server : fix build + rename enums (#4870)Georgi Gerganov
2024-01-10server : add a `/health` endpoint (#4860)Behnam M
* added /health endpoint to the server * added comments on the additional /health endpoint * Better handling of server state When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value. * initialized server_state * fixed a typo * starting http server before initializing the model * Update server.cpp * Update server.cpp * fixes * fixes * fixes * made ServerState atomic and turned two-line spaces into one-line
2024-01-10llama : add additional suffixes for model params (#4834)Brian
* llm_load_print_meta: Add additional suffixs for model params * Update llama.cpp model param log remove unneeded comments and convert from > to >=
2024-01-10llama : recognize 1B phi models (#4847)Austin
This update categorizes models with 24 layers as MODEL_1B, ensuring compatibility with different Phi model variants without impacting existing Phi-2 model functionality.
2024-01-10clip : support more quantization types (#4846)John
Uses ggml functions instead of hardcoded names and adds support to quantize into the modern Q-K variants. This is just the bare minimum to get k-types working - a more refined choice of types would be needed to get best quality on low quantizations. I ran a few tests, it doesn't break anything I could notice and a Q6_K ViT works almost as well as Q8_0 but 3 times the inference speed.
2024-01-10Python script to compare commits with llama-bench (#4844)Johannes Gäßler
2024-01-09convert.py : fix vanilla LLaMA model conversion (#4818)Austin
* Update Imports and Add Notes for Future Reference - Updated import statements in `convert.py`. - Added import for `AutoTokenizer` from `transformers` module. - Added conditional import for `gguf` from the local directory. - Added comments and notes for future reference. Additional Notes: - Noted removal of a redundant `TypeAlias` import. - Noted the removal of a `gguf` debug statement. - Commented on the presence of `ARCH` and `NDArray` definitions. - Commented on cleaning up and refactoring data type definitions. * Refine Model Hyperparameters and Params Class - Updated type annotations to use `Optional` for clarity. - Improved method names and attribute consistency. - Removed unnecessary variables for better code readability. Additional Notes: - Highlighted the use of `Optional` for clearer intent. - Ensured backward and forward compatibility. * Restore BpeVocab and SentencePieceVocab classes - Restored the BpeVocab class for handling BPE tokenization. - Restored the SentencePieceVocab class for SentencePiece tokenization. These classes are essential for maintaining the original behavior of the codebase. * refactor: Standardize vocabulary handling with HfVocab - Replaced VocabLoader with HfVocab, aligning vocabulary handling across classes. - Updated initialization of HfVocab with local_files_only=True for AutoTokenizer. - Introduced optional parameter fname_added_tokens for flexible added token management. - Streamlined added token handling for clarity and conciseness. - Maintained special tokens and IDs, enhancing token management. - Simplified token processing methods for improved readability. - Added a placeholder for score computation with a default value of -1000.0. - Optimized newline token check for efficiency. - Updated __repr__ function for clarity in representation. - Adjusted type alias Vocab to include BpeVocab, SentencePieceVocab, and HfVocab. - Removed redundant code related to special token handling, reverse vocabulary mapping, and vocabulary file detection. This refactoring promotes a standardized and modular approach to vocabulary management, facilitating future integration with a VocabFactory and improving code maintainability and scalability. * refactor: Enhance readability, functionality, and code quality - Improved code formatting and readability for better maintainability. - Refactored LazyUnpickler's CLASSES dictionary for clarity. - Added print statements and warnings in check_vocab_size for user feedback. - Removed find_vocab_file_path, as it's superseded by VocabFactory. - Preparatory changes for upcoming classes: OutputFile and VocabFactory. - Overall focus on code quality, error handling, and consistency. These changes reflect a continuous effort to refine the codebase, ensuring it meets best practices and prepares for future enhancements, such as the VocabFactory. * refactor: Update OutputFile class for enhanced model vocabulary management - Restructured the constructor for improved readability. - Updated `add_meta_arch` method for flexible model name determination. - Introduced `handle_tokenizer_model` for mapping vocab types to supported tokenizer models. - Streamlined vocabulary extraction with `extract_vocabulary_from_model`. - Simplified vocabulary metadata addition using `add_meta_vocab`. - Refactored `add_tensor_info` for clarity and consistency. - Improved error handling for better user feedback. These changes signify the development of a versatile and comprehensive `OutputFile` class, enabling efficient management of model conversion output, metadata, vocabulary, and tensor information. * feat: Introduce VocabFactory for flexible vocabulary management in model conversion - The VocabFactory class is added to facilitate modular vocabulary handling. - The constructor initializes a directory path and detects vocabulary-related files. - The _select_file method provides file paths based on vocabulary type (e.g., BPE, SentencePiece). - _create_special_vocab generates special vocabularies, accommodating different types. - The load_vocab method loads vocabularies, handling BPE, SentencePiece, and Hugging Face Fast Tokenizer. - Error handling and logging enhance debugging and user feedback. - The modular and flexible design simplifies vocabulary management and supports future extensions. The VocabFactory class enhances code modularity and maintainability, allowing versatile vocabulary handling in the model conversion process. * refactor: Improve code organization, argument parsing, and user interface - Renamed 'default_outfile' to 'default_output_file' for clarity. - Refactored argument parser setup into 'get_argument_parser' function. - Introduced descriptive comments for each argument in the parser. - Added '--vocab-type' argument with choices ["spm", "bpe", "hfft"] for vocabulary processing. - Improved flag naming consistency: '--outfile' to '--out-file' and '--bigendian' to '--big-endian'. - Enhanced error handling to prevent overwriting input data in 'default_output_file'. - Made 'argv' in 'main' an optional parameter for flexibility. - Introduced dynamic import for 'awq.apply_awq' based on 'args.awq_path' for conditional dependency. These changes enhance code clarity, organization, and the user interface of the script, aligning it with Python best practices and improving maintainability. * refactor: Further refine functionality, improve user interaction, and streamline vocabulary handling - Renamed command-line arguments for clarity and consistency. - Improved path resolution and import adjustments for robustness. - Thoughtfully handled 'awq-path' and conditional logic for the weighted model. - Enhanced model and vocabulary loading with the 'VocabFactory' class for structured and adaptable loading. - Strengthened error handling and user feedback for a more user-friendly experience. - Structured output file handling with clear conditions and defaults. - Streamlined and organized the 'main' function for better logic flow. - Passed 'sys.argv[1:]' to 'main' for adaptability and testability. These changes solidify the script's functionality, making it more robust, user-friendly, and adaptable. The use of the 'VocabFactory' class is a notable enhancement in efficient vocabulary handling, reflecting a thoughtful and iterative approach to script development. * chore: Apply ruff formatting to convert.py Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com> * Revert to commit 0614c33 * chore: Apply flake8 formatting rules Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com> * refactor: Revise `check_vocab_size` for Enhanced Clarity and Correctness - Resolved an unreachable branch issue by reorganizing the conditional structure. - Moved the special case check for `params.n_vocab == -1` to the top for immediate assertion. - Flattened the conditional logic for improved clarity and predictability of the function's behavior. These changes enhance the readability and functional correctness of the `check_vocab_size` function without altering its intended functionality. * py : fix outfile and outtype * py : suggest hint for missing vocab size --------- Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-09llava-cli : don't crash if --image flag is invalid (#4835)Justine Tunney
This change fixes an issue where supplying `--image missing-file` would result in a segfault due to a null pointer being dereferenced. This can result in distracting info being printed if robust crash analysis tools are being used.
2024-01-09metal : improve dequantize precision to match CPU (#4836)Georgi Gerganov
ggml-ci
2024-01-09scripts : improve get-pg.sh (#4838)Georgi Gerganov
2024-01-09readme : add 3rd party collama reference to UI list (#4840)iohub
Add a VSCode extension for llama.cpp reference to UI list
2024-01-09scripts : script to get Paul Graham essays in txt format (#4838)Georgi Gerganov
2024-01-09server : update readme about token probs (#4777)Behnam M
* updated server readme to reflect the gg/server-token-probs-4088 commit added explanation for the API's completion result which now includes `completion_probabilities`. Also added a JSON schema that shows the type/structure of `completion_probabilities`. * simplified the `completion_probabilities` JSON schema It's now easier to understand what the structure of `completion_probabilities` looks like. * minor : fix trailing whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-01-09server : add api-key flag to documentation (#4832)Zsapi
Document the api-key flag added to server in https://github.com/ggerganov/llama.cpp/pull/4441
2024-01-09ggml : fix vld1q_s8_x4 32-bit compat (#4828)Georgi Gerganov
* ggml : fix vld1q_s8_x4 32-bit compat ggml-ci * ggml : fix 32-bit ARM compat (cont) ggml-ci
2024-01-09CUDA: faster softmax via shared memory + fp16 math (#4742)Johannes Gäßler
2024-01-08common : fix the short form of `--grp-attn-w`, not `-gat` (#4825)howlger
See https://github.com/ggerganov/llama.cpp/blob/master/common/common.cpp#L230C53-L230C57
2024-01-08readme : add link to SOTA modelsGeorgi Gerganov
2024-01-08SOTA 2-bit quants (#4773)Kawrakow
* iq2_xxs: basics * iq2_xxs: scalar and AVX2 dot products Needed to change Q8_K to have quants in the -127...127 range, else the IQ2_XXS AVX implementation becomes very awkward. The alternative would have been to use Q8_0 instead. Perhaps I'll change later, for now this is what we have. * iq2_xxs: ARM_NEON dot product Somehow strangely slow (112 ms/token). * iq2_xxs: WIP Metal Dequantize works, something is still wrong with the dot product. * iq2_xxs: Metal dot product now works We have PP-512 = 475 t/s TG-128 = 47.3 t/s Not the greatest performance, but not complete garbage either. * iq2_xxs: slighty faster dot product TG-128 is now 48.4 t/s * iq2_xxs: slighty faster dot product TG-128 is now 50.9 t/s * iq2_xxs: even faster Metal dot product TG-128 is now 54.1 t/s. Strangely enough, putting the signs lookup table into shared memory has a bigger impact than the grid values being in shared memory. * iq2_xxs: dequantize CUDA kernel - fix conflict with master * iq2_xxs: quantized CUDA dot product (MMVQ) We get TG-128 = 153.1 t/s * iq2_xxs: slightly faster CUDA dot product TG-128 is now at 155.1 t/s. * iq2_xxs: add to llama ftype enum * iq2_xxs: fix MoE on Metal * Fix missing MMQ ops when on hipBLAS I had put the ggml_supports_mmq call at the wrong place. * Fix bug in qequantize_row_iq2_xxs The 0.25f factor was missing. Great detective work by @ggerganov! * Fixing tests * PR suggestion --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-01-08swift : exclude ggml-metal.metal from the package (#4822)Georgi Gerganov
2024-01-08llama.swiftui : update readmeGeorgi Gerganov
2024-01-08main : add self-extend support (#4815)Georgi Gerganov
* examples : add passkey test * passkey : better prints * passkey : select pass key pos from CLI * passkey : simplify n_past logic * llama : "self-extend"-like context extension * passkey : add comment * main : add Self-Extend support * llama : add comment about llama_kv_cache_seq_div
2024-01-08examples : add passkey test (#3856)Georgi Gerganov
* examples : add passkey test * passkey : better prints * passkey : select pass key pos from CLI * passkey : simplify n_past logic * make : add passkey target * passkey : add "self-extend"-like context extension (#4810) * llama : "self-extend"-like context extension * passkey : add comment * passkey : add readme
2024-01-07readme : add lgrammel/modelfusion JS/TS client for llama.cpp (#4814)Lars Grammel
2024-01-07llama-bench : add no-kv-offload parameter (#4812)slaren
2024-01-07CUDA: fixed redundant value dequantization (#4809)Johannes Gäßler
2024-01-07llama : remove unused vars (#4796)Georgi Gerganov
2024-01-07llama : remove redundant GQA check (#4796)Georgi Gerganov
2024-01-07llama.swiftui : use llama.cpp as SPM package (#4804)Alex Azarov
2024-01-07llama : print tensor meta for debuggingGeorgi Gerganov
2024-01-07llama.swiftui : add visionOS target (#4805)Alex Azarov
2024-01-07ggml : use __builtin_amdgcn_sudot4 in __dp4a for gfx11 (#4787)Konstantin Zhuravlyov
2024-01-07server : fix n_predict check (#4798)Georgi Gerganov
2024-01-06llama.swiftui : use correct pointer for llama_token_eos (#4797)Daniel Illescas Romero
2024-01-06examples : improve base-translate.sh script (#4783)Georgi Gerganov
2024-01-05cmake : check for openblas64 (#4134)a-n-n-a-l-e-e
openblas v0.3.22 64-bit pkg-config file is named openblas64.pc https://github.com/OpenMathLib/OpenBLAS/issues/3790
2024-01-05flake.nix : fix typo (#4700)Ikko Eltociear Ashimine
betwen -> between
2024-01-05metal : switch back to default.metallib (ggml/681)Georgi Gerganov
ggml-ci
2024-01-05ggml : fix q2_k bpw in comments (ggml/680)Georgi Gerganov
2024-01-05ggml : add error handling to graph_compute (whisper/1714)Finn Voorhees
2024-01-05ggml : do not sched_yield when calling BLAS (#4761)Georgi Gerganov
* ggml : do not sched_yield when calling BLAS ggml-ci * ggml : fix do_yield logic ggml-ci * ggml : simplify do_yield logic ggml-ci