Age | Commit message (Collapse) | Author |
|
|
|
|
|
|
|
* Update README.md
* missing space
* llama3 !
|
|
* llama : check that all the tensor data is in the model file
* also check for unsigned overflow
|
|
|
|
This commit renamesthe lerp (linear interpolation) function in clip.cpp
to avoid a conflict with the lerp function in the <cmath> standard C++
library when using c++20.
The motivation for this change is to enable projects that use c++20 to
be able to compile clip.cpp without having to resort to patching it. The
lerp function was added to cmath in version C++20 (202002L) and is why
this is not causing any issue at the moment as C++11/C++17 is currently
used by llama.cpp.
I realize that llama.cpp uses either C++11 (or C++17 in the case for
SYCL) but wanted to ask if this would be an acceptable change just the
same.
Refs: https://en.cppreference.com/w/cpp/numeric/lerp
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
|
|
ggml-ci
|
|
* tests : minor bash stuff
ggml-ci
* llama : fix build
ggml-ci
* tests : fix CUR_DIR -> ROOT_DIR
ggml-ci
* tests : fix fname
ggml-ci
|
|
* Implement '--keep-split' to quantize model into several shards
* Add test script
* Update examples/quantize/quantize.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Split model correctly even if tensor id is out-of-order
* Update llama_model_quantize_params
* Fix preci failures
---------
Co-authored-by: z5269887 <z5269887@unsw.edu.au>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
* add llama_get_pooling_type function
* fix argument name, move with ctx funcs
|
|
|
|
* fix: revert showing control tokens by default
* feat: revert changes to default behavior of llama_token_to_piece; provide overridden declaration to receive "bool special" param to toggle showing control tokens
* feat: use the overridden declaration of llama_token_to_piece from common/common.cpp to specify "false" so that control tokens are not shown in chat completion responses"
* common : simplify
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* Server: add tests for consistent results
* sampling: separate rng per sampling context
|
|
ggml-ci
|
|
* Add phi 3 chat template & tests
* test : fix chat template result
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* add support of codeqwen due to tokenizer
* override load_hparams
* fix typo
* fix load_params
* convert : fix whitespace
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* add explicit phi3 support
* add explicit phi3 support
* remove unused code
* convert : add BOS token
* llama : match EOT token <|end|>
* llama : minor / style
* llama : tabs -> spaces
* convert : fix lint checks
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
activated (#6767)
* Fix FP32/FP16 build instructions
* Fix typo
* Recommended build instruction
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
* Recommended build instruction
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
* Recommended build instruction
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
* Add comments in Intel GPU linux
---------
Co-authored-by: Anas Ahouzi <112881240+aahouzi-intel@users.noreply.github.com>
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
|
|
* llamafile : improve sgemm.cpp
- Re-enable by default
- Fix issue described in #6716
- Make code more abstract, elegant, and maintainable
- Faster handling of weirdly shaped `m` an `n` edge cases
* Address review comments
* Help clang produce fma instructions
* Address review comments
|
|
Latest gcc complains here:
/home/airlied/devel/llama.cpp/ggml-alloc.c: In function ‘ggml_gallocr_new_n’:
/home/airlied/devel/llama.cpp/ggml-alloc.c:374:59: warning: ‘calloc’ sizes specified with ‘sizeof’ in the earlier argument and not in the later argument [-Wcalloc-transposed-args]
374 | ggml_gallocr_t galloc = (ggml_gallocr_t)calloc(sizeof(struct ggml_gallocr), 1);
| ^~~~~~
/home/airlied/devel/llama.cpp/ggml-alloc.c:374:59: note: earlier argument should specify number of elements, later size of each element
and a bunch more.
calloc is specified to take nmemb first then size, so realign the code.
In a couple of places there was a * x, 1 so I fixed those to use calloc properly.
|
|
|
|
|
|
Flake lock file updates:
• Updated input 'nixpkgs':
'github:NixOS/nixpkgs/1042fd8b148a9105f3c0aca3a6177fd1d9360ba5?narHash=sha256-3sbWO1mbpWsLepZGbWaMovSO7ndZeFqDSdX0hZ9nVyw%3D' (2024-04-10)
→ 'github:NixOS/nixpkgs/5c24cf2f0a12ad855f444c30b2421d044120c66f?narHash=sha256-XtTSSIB2DA6tOv%2Bl0FhvfDMiyCmhoRbNB%2B0SeInZkbk%3D' (2024-04-19)
|
|
* `build`: generate hex dumps of server assets on the fly
* build: workaround lack of -n on gnu xxd
* build: don't use xxd in cmake
* build: don't call xxd from build.zig
* build: more idiomatic hexing
* build: don't use xxd in Makefile (od hackery instead)
* build: avoid exceeding max cmd line limit in makefile hex dump
* build: hex dump assets at cmake build time (not config time)
|
|
* make : fix common dep on llama.h
* llama : add option to render special tokens
* readme : add API change notice
ggml-ci
* swift : fix build
|
|
|
|
* Added llama-3 chat template
* Update llama.cpp
Co-authored-by: Samuel Tallet <36248671+SamuelTallet@users.noreply.github.com>
* Update llama.cpp
Co-authored-by: Samuel Tallet <36248671+SamuelTallet@users.noreply.github.com>
* Update tests/test-chat-template.cpp
Co-authored-by: Samuel Tallet <36248671+SamuelTallet@users.noreply.github.com>
* Added EOS stop sequence according to https://github.com/ggerganov/llama.cpp/pull/6751#issuecomment-2065602862
* Removed adding of BOS token before first message
* Removed bos token from expected output from llama-3
* Update tests/test-chat-template.cpp
Co-authored-by: Rene Leonhardt <65483435+reneleonhardt@users.noreply.github.com>
* Update tests/test-chat-template.cpp
Co-authored-by: Rene Leonhardt <65483435+reneleonhardt@users.noreply.github.com>
* Added <|end_of_text|> as another stop token
* Reverted last change of adding the end_of_text stop word for llama 3
---------
Co-authored-by: Wouter Tichelaar <tichelaarw@spar.net>
Co-authored-by: Samuel Tallet <36248671+SamuelTallet@users.noreply.github.com>
Co-authored-by: Rene Leonhardt <65483435+reneleonhardt@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
|
|
* added fedora to list of distros that may need the package (the packages have the same name on Fedora)
* how to add clblast that is avalible in the fedora repos
|
|
This change removes printf() logging so llava-cli is shell scriptable.
|
|
* Support Llama 3 conversion
The tokenizer is BPE.
* style
* Accept suggestion
Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
* llama : add llama_token_is_eog()
ggml-ci
* llama : auto-detect more EOT tokens when missing in KV data
* convert : replacing EOS token is a hack
* llama : fix codegemma EOT token + add TODOs
* llama : fix model type string for 8B model
---------
Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
* common : disable get_math_cpu_count() until Android CI gets fixed
* common : another try
|
|
(#6748)
|
|
|
|
* implement olmo architecture
* remove unused variable
* remove unused moe branch
* remove check for weight
* remove superfluous moe, bias and rope tensors
* clarified comment
* fix clamp_kqv setting
* remove obsolete parameter name filter
|
|
* llama : make general.name optional
* train: Add 'general.name' to model metadata
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
---------
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
Co-authored-by: jianyuzh <jianyu.zhang@intel.com>
|
|
* ggml : group all experts in a single ggml_mul_mat_id
cuda : improve mmid row copy
* cuda : fix bin bcast with non-cont src0
* test-backend-ops : only run all mul mat tests for base types
* llama : disable moe offloading with SYCL
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* Support converting models with multiple chat templates
Adds the following metadata:
* tokenizer.chat_templates
* tokenizer.chat_template.<name1>
* tokenizer.chat_template.<name2>
* tokenizer.chat_template.<...>
Where `tokenizer.chat_templates` is an array of the template names (except `default`), `default` is added to the regular `tokenizer.chat_template`.
* replace filtered characters with underscore
* New script to add/modify/remove metadata
This scripts creates a copy of a GGUF file and allows you to add/modify/remove metadata in the process.
Most importantly this allows you to update chat templates, either as a string or directly from an updated tokenizer_config.json file.
* Add files via upload
add new script to project/readme
* flake--
|
|
|
|
|
|
* build : sgemm.o only when needed
ggml-ci
* llamafile : tmp disable due to MoE bug
ggml-ci
|
|
* Update README.md
* Update README.md
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* fix autoawq quantized gemma model convert error
using autoawq to quantize gemma model will include a lm_head.weight tensor in model-00001-of-00002.safetensors. it result in this situation that convert-hf-to-gguf.py can't map lm_head.weight. skip loading this tensor could prevent this error.
* change code to full string match and print necessary message
change code to full string match and print a short message to inform users that lm_head.weight has been skipped.
---------
Co-authored-by: Zheng.Deng <32841220+CUGfred@users.noreply.github.com>
|
|
|
|
ggml-ci
|