Age | Commit message (Collapse) | Author |
|
|
|
ggml-ci
|
|
|
|
ggml-ci
|
|
|
|
|
|
* add magika inference example
* ggml : fix unaligned accesses in custom ops
* ggml : fix FP32 GELU for values that exceed the FP16 range
* use ggml_pool_1d
* add README
* Update README.md
* pad inputs if the files are too small
* cleanup
ggml-ci
|
|
* Introduce backend GUIDs
Initial proposed implementation of backend GUIDs
(Discussed in https://github.com/ggerganov/ggml/pull/741)
Hardcoded CPU backend GUID (for now)
Change ggml_backend_is_cpu logic to use GUID
* Remove redundant functions
Remove redundant functions `ggml_backend_i::get_name` and `ggml_backend_guid` which are not desired for future expansion
* Add spaces to match style
Co-authored-by: slaren <slarengh@gmail.com>
* Fix brace style to match
Co-authored-by: slaren <slarengh@gmail.com>
* Add void to () in function signature
Co-authored-by: slaren <slarengh@gmail.com>
* Add back ggml_backend_guid and make CPU_GUID a local static in ggml_backend_cpu_guid
* add guids to all backends
ggml-ci
---------
Co-authored-by: slaren <slarengh@gmail.com>
|
|
* server: twice ctrl+C to exit
* std::atomic_flag
* sigint: message
* sigint: stderr
* Update examples/server/server.cpp
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
---------
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
|
|
This reverts a single line from #5475
|
|
* implement nfd for stripping accents in wpm tokenizer
* sort nfd map; reuse iterator
* use builtin tolower
* add locale include
* Simplify to_lower cases
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
---------
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
|
|
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
|
|
* Add "/chat/completions" as alias for "/v1/chat/completions"
* merge to upstream master
* minor : fix trailing whitespace
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* WIP: make i-quants work for QK_K = 64
* iq2_xs: attempt to fix AVX dot product for QK_K = 64
Tests pass, but I get gibberish.
* QK_K = 64 tests pass on ARM_NEON and Metal
Sadly, that does not mean it actually works.
* Make CUDA compile with QK_K = 64
Tests don't pass, plus we get misaligned access
* Q2_K: fixed bug in imatrix quantization for QK_K = 64
* iq1_s: turn off SIMD implementation for QK_K = 64 (it does not work)
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Try IQ4_NL with blocks of 64 - does not look good
* iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32
* iq4_xs: CUDA works - 133.2 t/s
* iq4_xs: AVX2 dot product
* iq4_xs: ARM_NEON dot product
* iq4_nl: Metal implementation
As usual, Metal / Apple Silicon don't like my quants.
* iq3_xs: minor fix
* iq4_xs: shrink by using IQ3_S for attn_k and attn_q
* iq4_xs: revert using IQ3_S for attn_k and attn_v
PPL vs size is good, but CPU performance suffers: on M2 Max
TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X
to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when
using IQ3_S vs 133 t/s with pure IQ4_XS.
* Fix CI
* iq4_xs: Added forgotten check for 256 divisibility
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
|
|
* llama : fix defrag bugs + enable by default
ggml-ci
* llama : add defrag_thold parameter
ggml-ci
* llama : cont
* llama : disable log message
ggml-ci
* llama : fix graph size check during defrag
|
|
* make: use arch variable for cublas
* fix UNAME_M
* check opt first
---------
Co-authored-by: lindeer <le.chang118@gmail.com>
|
|
|
|
range (#5721)
* Adding IQ2_S and IQ2_M as a single cumulative commit
* Update examples/quantize/quantize.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
* Add LLMFarm (ui for iOS) to list
|
|
* Add support for bias
* Update pre-processor
* rm commented code
* fix format
* fix CI
---------
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
|
|
|
|
|
|
|
|
|
|
Flake lock file updates:
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/5863c27340ba4de8f83e7e3c023b9599c3cb3c80' (2024-02-16)
→ 'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23)
|
|
* server: tests - longer inference timeout for CI
|
|
* server: docs - refresh and tease a little bit more the http server
* Rephrase README.md server doc
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update examples/server/README.md
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update examples/server/README.md
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update README.md
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* llama : refactor k-shift implementation
ggml-ci
* llama : rename llama_kv_cache_seq_shift to llama_kv_cache_seq_add
* llama : cont k-shift refactoring + normalize type names
ggml-ci
* minor : fix MPI builds
* llama : reuse n_rot from the build context
ggml-ci
* llama : revert enum name changes from this PR
ggml-ci
* llama : update llama_rope_type
* llama : add comment about rope values
* llama : fix build
* passkey : apply kv cache updates explicitly
ggml-ci
* llama : change name to llama_kv_cache_update()
* llama : add llama_kv_cache_seq_pos_max()
* passkey : fix llama_kv_cache_seq_pos_max() usage
* llama : some llama_kv_cell simplifications
* llama : add llama_kv_cache_compress (EXPERIMENTAL)
* llama : add alternative KV cache merging (EXPERIMENTAL)
* llama : add llama_kv_cache_defrag
* llama : comments
* llama : remove llama_kv_cache_compress
will add in a separate PR
ggml-ci
* llama : defragment via non-overlapping moves
* llama : ggml_graph based defrag implementation
ggml-ci
* llama : switch the loop order in build_defrag
* llama : add comments
|
|
The system prompt is now decoded in batches.
* server : fix off-by-one n_past when start of prompt matches whole cache
The tokens right after the matching part would otherwise skip a pos value.
|
|
* [ggml-quants] Provide ggml_vqtbl1q_u8 for 64bit compatibility
vqtbl1q_u8 is not part of arm v7 neon library
* [android-example] Remove abi filter after arm v7a fix
* [github-workflows] Do not skip Android armeabi-v7a build
|
|
fix nvcc version is empty
|
|
|
|
* server: logs - always use JSON logger, add add thread_id in message, log task_id and slot_id
* server : skip GH copilot requests from logging
* server : change message format of server_log()
* server : no need to repeat log in comment
* server : log style consistency
* server : fix compile warning
* server : fix tests regex patterns on M2 Ultra
* server: logs: PR feedback on log level
* server: logs: allow to choose log format in json or plain text
* server: tests: output server logs in text
* server: logs switch init logs to server logs macro
* server: logs ensure value json value does not raised error
* server: logs reduce level VERBOSE to VERB to max 4 chars
* server: logs lower case as other log messages
* server: logs avoid static in general
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server: logs PR feedback: change text log format to: LEVEL [function_name] message | additional=data
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
endpoint (#5708)
* server: monitoring - add /metrics prometheus compatible endpoint
* server: concurrency issue, when 2 task are waiting for results, only one call thread is notified
* server: metrics - move to a dedicated struct
|
|
|
|
* coda : normalize enum names
ggml-ci
* code : cont
* code : cont
|
|
* Fix issues during StableLM models conversion
* Fix hard coded layer_norm_eps
* Support layer_norm_eps for LlavaStableLM
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Add missing parenthesis
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Support rotary_factor for LlavaStableLM
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* fix typo
* Add StableLMEpochForCausalLM for safety
Co-authored-by: compilade <113953597+compilade@users.noreply.github.com>
* Add StableLMEpochForCausalLM for safety 2
Co-authored-by: compilade <113953597+compilade@users.noreply.github.com>
---------
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
Co-authored-by: compilade <113953597+compilade@users.noreply.github.com>
|
|
* server: #5655 - continue to update other slots on embedding concurrent request.
* server: tests: add multi users embeddings as fixed
* server: tests: adding OAI compatible embedding concurrent endpoint
* server: tests: adding OAI compatible embedding with multiple inputs
|
|
* iq4_nl: squash commits for easier rebase
* Basics (quantize, dequantize)
* CUDA dequantize and dot product
* Slightly faster CUDA dot product (120 t/s)
* Switch to 6-bit scales
* Scalar dot product
* AVX2 dot product
* ARM_NEON dot product
* Works on metal, but still slow
* Slightly better Metal dot product
* Another small Metal improvement
* Metal dot product is getting there
* Faster CUDA dot product
* Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
* Report the actual bpw
* Add _xs mix that is 4.05 bpw for non-MoE models
* Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
* AVX2 dot product uses Q8_0 instead of Q8_K
* Add to test-backend-ops
* Minor fix
* Also use use Q5_K for attn_output in MoE models
* Fixes after merging latest master
* Switching to blocks of 32
* AVX2 for blocks of 32
* Scaler dot product for blocks of 32
* ARM_NEON dot product for blocks of 32
* Metal kernels for blocks of 32
* Slightly faster Metal kernels
* Resurrecting iq3_xs
After all the experimentation, nothing was better than this.
* Minor PPL improvement via a block scale fudge factor
* Minor improvement via 3 neighbours
* iq3_xs: working scalar and AVX2 dot products
* iq3_xs: ARM_NEON dot product - works but extremely slow (10 t/s)
* iq3_xs: working Metal implementation
* Adding IQ3_M - IQ3_XS mix with mostly Q4_K
* iiq3_xs: a 3.4375 bpw variant
* iq3_xs: make CUDA work for new version
* iq3_xs: make scalar and AVX2 work for new version
* iq3_s: make ARM_NEON work with new version
* iq3_xs: make new version work on metal
Performance is very similar to Q3_K_S
* iq3_xs: tiny Metal speed improvement
* iq3_xs: tiny Metal speed improvement
* Fix stupid warning
* Q3_K_XS now uses a mix of IQ3_XS and IQ3_XXS
* iq3_xs: rename to iq3_s
* iq3_s: make tests pass
* Move Q3_K_XS mix to 3.25 bpw
* Attempt to fix failing tests
* Another attempt to fix the Windows builds
* Attempt to fix ROCm
* ROCm again
* iq3_s: partial fix for QK_K = 64
* iq3_s: make it work on metal for QK_K = 64
Pleasent surprise: the coding was super-block size independent,
so all it took was to delete some QK_K == 256 guards.
* Will this fix ROCm?
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* server: tests: init scenarios
- health and slots endpoints
- completion endpoint
- OAI compatible chat completion requests w/ and without streaming
- completion multi users scenario
- multi users scenario on OAI compatible endpoint with streaming
- multi users with total number of tokens to predict exceeds the KV Cache size
- server wrong usage scenario, like in Infinite loop of "context shift" #3969
- slots shifting
- continuous batching
- embeddings endpoint
- multi users embedding endpoint: Segmentation fault #5655
- OpenAI-compatible embeddings API
- tokenize endpoint
- CORS and api key scenario
* server: CI GitHub workflow
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
|
|
|
|
* gemma : use Q8_0 for the token_embd.weight tensor
* llama : quantize token_embd.weight using output type
|
|
* py : add gemma conversion from HF models
* Update convert-hf-to-gguf.py
Co-authored-by: Aarni Koskela <akx@iki.fi>
* Update convert-hf-to-gguf.py
Co-authored-by: Aarni Koskela <akx@iki.fi>
* Update convert-hf-to-gguf.py
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
---------
Co-authored-by: Aarni Koskela <akx@iki.fi>
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
|