Age | Commit message (Collapse) | Author |
|
|
|
|
|
operators. (ggml/747)
* cuda: fix group_norm
* cuda: add batch inference support for ggml_pad/ggml_upscale
* add ggml_arrange
* add ggml_timestep_embedding
* update ggml_arange/ggml_timestep_embedding tests
* cuda: fix im2col
* add ggml_arange/ggml_timestep_embbeding support for metal backend
* fix some bugs
* fix some bugs
* Update ggml.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml-cuda.cu
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml-metal.m
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml-metal.m
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml-metal.metal
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* modify according to the review comments
* ggml : fix compile warnings + code style
* ggml : normalize compute_forward calls + fix seg fault in debug
* minor
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
|
|
|
|
* Support special tokens as reverse/anti prompt.
* Tokenize antiprompts only once.
* main : minor
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
|
|
* allow for user specified pooling type
* llama : use enum types over int
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
Co-authored-by: Black_Fox <radekliska@gmail.com>
|
|
The row size of the saved states was based on kv_self.head while
it should be based on llama_kv_cache_cell_max.
Existing session files should still work.
* llama : fix llama_kv_cache_cell_max inability to return 1
I've also changed its return type to uint32_t,
because this function is always used to set the value of uint32_t variables,
and because the index already has this type.
* llama : fix state size calculation
Some bytes in the state were unaccounted for in llama_get_state_size.
Since the logits reserve so much space, it did not cause problems.
|
|
|
|
|
|
Flake lock file updates:
• Updated input 'flake-parts':
'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01)
→ 'github:hercules-ci/flake-parts/f7b3c975cf067e56e7cda6cb098ebe3fb4d74ca2' (2024-03-01)
• Updated input 'flake-parts/nixpkgs-lib':
'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29)
→ 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8?dir=lib' (2024-02-29)
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23)
→ 'github:NixOS/nixpkgs/1536926ef5621b09bba54035ae2bb6d806d72ac8' (2024-02-29)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
|
|
* server: tests: add models endpoint scenario
* server: /v1/models add some metadata
* server: tests: add debug field in context before scenario
* server: tests: download model from HF, add batch size
* server: tests: add passkey test
* server: tests: add group attention params
* server: do not truncate prompt tokens if self-extend through group attention is enabled
* server: logs: do not truncate log values
* server: tests - passkey - first good working value of nga
* server: tests: fix server timeout
* server: tests: fix passkey, add doc, fix regex content matching, fix timeout
* server: tests: fix regex content matching
* server: tests: schedule slow tests on master
* server: metrics: fix when no prompt processed
* server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1
* server: tests: increase timeout for completion
* server: tests: keep only the PHI-2 test
* server: tests: passkey add a negative test
|
|
* using abort_callback from ggml to stop llama computation
* format fix
* a brief explaining comment
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
ggml-ci
|
|
exist (#5821)
|
|
|
|
* iq3_s: somewhat faster AVX2 dot product
On Ryzen a 7950X TG-128 increases to 16 t/s from 15.5 t/s using
16 threads. For 8 threads it is 13.85 t/s vs 11.75 t/s.
PP-512 increases to 28.5 t/s from 23.8 t/s.
* iq3_s: somewhat faster ARM_NEON dot product
Still dog slow - 10.7 t/s up from 9.9 t/s.
* iq3_s: another small ARM_NEON improvement
10.7 -> 11.0 t/s. Using vmulq_s8 is faster than the xor - sub trick
that works best on AVX2.
* iq3_s: minor improvement on Metal
49.4 t/s -> 50.3 t/s
* iq3_s: PPL improvement
E.g., for a context of 4096 LLaMA-v2-7B goes to 5.1340 from 5.1653.
* iq3_s: use new grid everywhere
* Fix ARM_NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
|
|
|
|
* llama : fix segfault from unknown model arch name
* llama : make all LLM maps const
This also requires using `std::map::at` instead of its `operator[]`
which does not exist for const maps.
* llama : name LLM_ARCH_UNKNOWN to "(unknown)"
This avoids errors from `std::map::at` when
getting the general name of the model architecture.
Using "(unknown)" instead of an empty string as per suggestion
https://github.com/ggerganov/llama.cpp/pull/5820#issuecomment-1973735284
* llama : remove redundant inner const for LLM_TENSOR_NAMES
The extra const won't do anything here as const maps
return const references to values.
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* llama : remove redundant nullptr check in llm_arch_from_string
Since LLM_ARCH_NAMES is a const map, no spurious elements
with a NULL name are inserted anymore, so this check is dead code.
---------
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
|
|
* suport multiple cards: split-mode - layer|row
* rm warning
* rebase with master, support tow new OPs, close feature for -sm=row, fix for unit test
* update news
* fix merge error
* update according to review comments
|
|
Reduces peak tmpfs usage and should prevent the check from failing from
running out of space.
Fixes the 'No space left on device' issue mentioned in #5703.
|
|
* build(nix): Introduce flake.formatter for `nix fmt`
* chore: Switch to pkgs.nixfmt-rfc-style
|
|
|
|
* Add support for starcoder2
* handle rope type
* skip rope freq and rotary embeddings from being serialized
* resolve comments
* Update llama.cpp
* remove redundant changes
* handle `rope-theta`
* llama : change starcoder2 rope type
* address comment
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
|
|
|
|
|
|
* cleanup unused --no-mul-mat-q,-nommq, -mmq, --mul-mat-q, mul_mat_q
* remove: mul_mat_q in compare llama bench and usage
* update llama-bench
---------
Co-authored-by: slaren <slarengh@gmail.com>
|
|
* switch to multimap based nfd_map due to compile time issues
* simplify multimap keys
* dont construct new locale every time
|
|
|
|
|
|
|
|
* Use batched mul_mat pathway
* rm extra line
* Explicitly state scaled data type
---------
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
|
|
* server: normalize naming
* fix spacing
|
|
|
|
ggml-ci
|
|
|
|
ggml-ci
|
|
|
|
|
|
* add magika inference example
* ggml : fix unaligned accesses in custom ops
* ggml : fix FP32 GELU for values that exceed the FP16 range
* use ggml_pool_1d
* add README
* Update README.md
* pad inputs if the files are too small
* cleanup
ggml-ci
|
|
* Introduce backend GUIDs
Initial proposed implementation of backend GUIDs
(Discussed in https://github.com/ggerganov/ggml/pull/741)
Hardcoded CPU backend GUID (for now)
Change ggml_backend_is_cpu logic to use GUID
* Remove redundant functions
Remove redundant functions `ggml_backend_i::get_name` and `ggml_backend_guid` which are not desired for future expansion
* Add spaces to match style
Co-authored-by: slaren <slarengh@gmail.com>
* Fix brace style to match
Co-authored-by: slaren <slarengh@gmail.com>
* Add void to () in function signature
Co-authored-by: slaren <slarengh@gmail.com>
* Add back ggml_backend_guid and make CPU_GUID a local static in ggml_backend_cpu_guid
* add guids to all backends
ggml-ci
---------
Co-authored-by: slaren <slarengh@gmail.com>
|
|
* server: twice ctrl+C to exit
* std::atomic_flag
* sigint: message
* sigint: stderr
* Update examples/server/server.cpp
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
---------
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
|
|
This reverts a single line from #5475
|
|
* implement nfd for stripping accents in wpm tokenizer
* sort nfd map; reuse iterator
* use builtin tolower
* add locale include
* Simplify to_lower cases
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
---------
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
|
|
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
|