summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-05-16rpc : get available mem for the CPU backendRadoslav Gerganov
This can be overridden with the -m command line option ref: #7293
2024-05-16rpc : add command line arg for specifying backend memoryRadoslav Gerganov
ref: #7293
2024-05-16convert : get general.name from model dir, not its parent (#5615)Jared Van Bortel
Co-authored-by: Brian <mofosyne@gmail.com>
2024-05-16grammar, json, llama: replace push on emplace if it possible (#7273)Herman Semenov
2024-05-16doc: add references to hugging face GGUF-my-repo quantisation web tool. (#7288)Vaibhav Srivastav
* chore: add references to the quantisation space. * fix grammer lol. * Update README.md Co-authored-by: Julien Chaumond <julien@huggingface.co> * Update README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Julien Chaumond <julien@huggingface.co> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-16ci: fix bin/Release path for windows-arm64 builds (#7317)Max Krasnyansky
Switch to Ninja Multi-Config CMake generator to resurect bin/Release path that broke artifact packaging in CI.
2024-05-16Add support for properly optimized Windows ARM64 builds with LLVM and MSVC ↵Max Krasnyansky
(#7191) * logging: add proper checks for clang to avoid errors and warnings with VA_ARGS * build: add CMake Presets and toolchian files for Windows ARM64 * matmul-int8: enable matmul-int8 with MSVC and fix Clang warnings * ci: add support for optimized Windows ARM64 builds with MSVC and LLVM * matmul-int8: fixed typos in q8_0_q8_0 matmuls Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * matmul-int8: remove unnecessary casts in q8_0_q8_0 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-15readme : remove stray double quote (#7310)Daniel Bevenius
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-05-15ggml : use dynamic thread scheduling for matrix multiplication (#6915)kunnis
* Just reordering some structs. * Adding in the calls to mm_pause * Passing around the state * Renaming and moving a bunch of variables around. * Extracting the logic to it's own function. * Moving some variable definitions into the chunk function. * Moving some variables around * moving src1_cont inside * Moving row_size * adding the current_chunk * Reorg the code. * Formatting to match the orig patch * starting to setup the chunking variables * Starting the buildup of the loop * The yield shouldn't be necessary. * adding the looping structure based on the chunk configuration. * Add in the re-chunking code. * Making it much more likely to rechunk. * disable resizing if numa is enabled. * Updating comments with what we've learned. * Fix formatting * Couple more formatting fixes. * More style fixes. * Fix Warnings * Going with unused because there's conditional logic that needs it. * Update ggml.c * Update ggml.c ---------
2024-05-15Avoid unnecessarily disabling CUDA graphs (#7302)agray3
As discussed in PR #6766, CUDA graphs were being disabled in the presence of long prompts. This fixes the issue by avoiding the consective update counter from incrementing unnecessarily for tokens in which cuda graphs are disabled due to batch size > 1.
2024-05-15ggml : tag ggml_tensor::backend as deprecated (#7290)slaren
2024-05-15Add missing " (#7303)AidanBeltonS
2024-05-15embedding : free the batch after execution (#7297)dm4
2024-05-15sync : ggmlGeorgi Gerganov
2024-05-15ggml : add `ggml_upscale_ext` (ggml/814)John Balis
* initial commit with CPU implementation of upscale to shape and test, cuda implementation next * experimental commit to see if dst shape is correct * test version * test * removed unnecessary params * refactor * fixed tests * ggml : metal impl + cleanup + sycl dev warnings * patched ggml_upscale cuda op to handle non-contiguous tensors, added test for non-contiguous behavior * metal : fix upsacle op to support nb00 + style --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-15server bench: fix bench not waiting for model load (#7284)Johannes Gäßler
2024-05-14script : sync ggml-rpcGeorgi Gerganov
2024-05-14metal : support FA without mask + add asserts (#7278)Georgi Gerganov
* ggml : fa without mask + add asserts ggml-ci * metal : support non-contiguous KV ggml-ci
2024-05-14sync : ggmlGeorgi Gerganov
ggml-ci
2024-05-14metal : tune soft_max number of threads (whisper/0)Georgi Gerganov
2024-05-14ggml : try fix ppc64 (whisper/0)Georgi Gerganov
2024-05-14ggml : expose SSE3 and SSSE3 for MSVC when AVX is available (whisper/2128)Przemysław Pawełczyk
2024-05-14ggml : optimize for ppc64le using VSX intrinsics (ggml/784)Hong Bo PENG
* optimize for ppc64le using VSX intrinsics * 1. code clean up by removing comments about overflow concern. 2. fix typo in suffix of scaling. * Continue to fix typo in suffix of scaling for QK_K <> 256 --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-14server: free sampling contexts on exit (#7264)Steve Grubb
* server: free sampling contexts on exit This cleans up last leak found by the address sanitizer. * fix whitespace * fix whitespace
2024-05-14Revert "move ndk code to a new library (#6951)" (#7282)Brian
This reverts commit efc8f767c8c8c749a245dd96ad4e2f37c164b54c.
2024-05-14ggml : add RPC backend (#6829)Radoslav Gerganov
* ggml : add RPC backend The RPC backend proxies all operations to a remote server which runs a regular backend (CPU, CUDA, Metal, etc). * set TCP_NODELAY * add CI workflows * Address review comments * fix warning * implement llama_max_devices() for RPC * Address review comments * Address review comments * wrap sockfd into a struct * implement get_alignment and get_max_size * add get_device_memory * fix warning * win32 support * add README * readme : trim trailing whitespace * Address review comments * win32 fix * Address review comments * fix compile warnings on macos
2024-05-14llama : disable pipeline parallelism with nkvo (#7265)slaren
2024-05-14move ndk code to a new library (#6951)Elton Kola
2024-05-14Add left recursion check: quit early instead of going into an infinite loop ↵Haggai Nuchi
(#7083) * Add left recursion check: quit early instead of going into an infinite loop * Remove custom enum, rename left recursion check and move to "grammar internal" section, add handling for edge case where a leftmost nonterminal may be empty * Remove unnecessary declaration
2024-05-14docs: Fix typo and update description for --embeddings flag (#7026)Ryuei
- Change '--embedding' to '--embeddings' in the README - Update the description to match the latest --help output - Added a caution about defining physical batch size
2024-05-13convert-hf : support direct Q8_0 conversion (#7234)compilade
* convert-hf : support q8_0 conversion * convert-hf : add missing ftype This was messing with the checksums otherwise. * convert-hf : add missing ftype to Baichuan and Xverse I didn't notice these on my first pass.
2024-05-13llama : less KV padding when FA is off (#7257)Georgi Gerganov
ggml-ci
2024-05-14llava-cli: fix base64 prompt (#7248)k.h.lai
2024-05-13perplexity: add BF16 vs. FP16 results (#7150)Johannes Gäßler
2024-05-13[SYCL] rm wait() (#7233)Neo Zhang
2024-05-13llama : rename jina tokenizers to v2 (#7249)Joan Fontanals
* refactor: rename jina tokenizers to v2 * refactor: keep refactoring non-breaking
2024-05-13convert.py: Outfile default name change and additional metadata support (#4858)Brian
* convert.py: Outfile default name change and additional metadata support * convert.py: don't stringify Metadata load method output * convert.py: typo fix * convert.py: fix metadata format to sync with LLM_KV_NAMES in llama.cpp
2024-05-13change default temperature of OAI compat API from 0 to 1 (#7226)Benjamin Findley
* change default temperature of OAI compat API from 0 to 1 * make tests explicitly send temperature to OAI API
2024-05-13[SYCL] Add oneapi runtime dll files to win release package (#7241)Neo Zhang
* add oneapi running time dlls to release package * fix path * fix path * fix path * fix path * fix path --------- Co-authored-by: Zhang <jianyu.zhang@intel.com>
2024-05-13[SYCL] update CI with oneapi 2024.1 (#7235)Neo Zhang
Co-authored-by: Zhang <jianyu.zhang@intel.com>
2024-05-12CUDA: add FP32 FlashAttention vector kernel (#7188)Johannes Gäßler
* CUDA: add FP32 FlashAttention vector kernel * fixup! CUDA: add FP32 FlashAttention vector kernel * fixup! fixup! CUDA: add FP32 FlashAttention vector kernel * fixup! fixup! fixup! CUDA: add FP32 FlashAttention vector kernel
2024-05-12cmake : fix version cmp (#7227)Georgi Gerganov
2024-05-12remove convert-lora-to-ggml.py (#7204)slaren
2024-05-11metal : fix warnings (skipme) (#0)Georgi Gerganov
2024-05-11sync : ggmlGeorgi Gerganov
2024-05-11metal : fix indent (ggml/0)Georgi Gerganov
2024-05-11ggml : resolve merge (ggml/0)Georgi Gerganov
ggml-ci
2024-05-12Scripting & documenting debugging one test without anything else in the ↵Josh Ramer
loop. (#7096) * A little documentation that shares my quick tips for working in the repository. * Update startup-testing-debugging.md * script that shows a menu of tests to pick from & run the debugger on * debug-test.sh: Refactor CLI help message * debug-test.sh: documentation update * debug-test.sh: CLI Help output corrections * debug-test.sh: minor doc fix --------- authored-by: Josh Ramer <ubuntu@ip-172-31-32-53.ec2.internal> Assisted-by: brian khuu <mofosyne@gmail.com>
2024-05-11fix system prompt handling (#7153)Xuan Son Nguyen
2024-05-11convert-hf : support bfloat16 conversion (#7158)compilade
* convert-hf : support bfloat16 conversion * gguf-py : flake8 fixes * convert-hf : add missing space after comma * convert-hf : get bit-exact same output as ./quantize The quantization version was missing. * convert-hf : don't round bf16 NANs * convert-hf : save some memory with np.int16 intermediate bf16 weights * convert-hf : more closely match llama.cpp with which weights to keep in f32 * convert-hf : add --outtype auto-f16 A reason for this to exist is for model quantizers who want an initial GGUF with the most fidelity to the original model while still using a 16-bit float type instead of 32-bit floats. * convert-hf : remove a semicolon because flake8 doesn't like it It's a reflex from when programming in C/C++, I guess. * convert-hf : support outtype templating in outfile name * convert-hf : rename --outtype auto-f16 to --outtype auto