Age | Commit message (Collapse) | Author |
|
gguf-split : improve --split and --merge logic (#9619)
* make sure params --split and --merge are not specified at same time
* update gguf-split params parse logic
* Update examples/gguf-split/gguf-split.cpp
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
---------
gguf-split : add basic checks (#9499)
* gguf-split : do not overwrite existing files when merging
* gguf-split : error when too many arguments are passed
Authored-by: slaren <slarengh@gmail.com>
|
|
llama-llava-cli, etc... (#7809)
* `main`/`server`: rename to `llama` / `llama-server` for consistency w/ homebrew
* server: update refs -> llama-server
gitignore llama-server
* server: simplify nix package
* main: update refs -> llama
fix examples/main ref
* main/server: fix targets
* update more names
* Update build.yml
* rm accidentally checked in bins
* update straggling refs
* Update .gitignore
* Update server-llm.sh
* main: target name -> llama-cli
* Prefix all example bins w/ llama-
* fix main refs
* rename {main->llama}-cmake-pkg binary
* prefix more cmake targets w/ llama-
* add/fix gbnf-validator subfolder to cmake
* sort cmake example subdirs
* rm bin files
* fix llama-lookup-* Makefile rules
* gitignore /llama-*
* rename Dockerfiles
* rename llama|main -> llama-cli; consistent RPM bin prefixes
* fix some missing -cli suffixes
* rename dockerfile w/ llama-cli
* rename(make): llama-baby-llama
* update dockerfile refs
* more llama-cli(.exe)
* fix test-eval-callback
* rename: llama-cli-cmake-pkg(.exe)
* address gbnf-validator unused fread warning (switched to C++ / ifstream)
* add two missing llama- prefixes
* Updating docs for eval-callback binary to use new `llama-` prefix.
* Updating a few lingering doc references for rename of main to llama-cli
* Updating `run-with-preset.py` to use new binary names.
Updating docs around `perplexity` binary rename.
* Updating documentation references for lookup-merge and export-lora
* Updating two small `main` references missed earlier in the finetune docs.
* Update apps.nix
* update grammar/README.md w/ new llama-* names
* update llama-rpc-server bin name + doc
* Revert "update llama-rpc-server bin name + doc"
This reverts commit e474ef1df481fd8936cd7d098e3065d7de378930.
* add hot topic notice to README.md
* Update README.md
* Update README.md
* rename gguf-split & quantize bins refs in **/tests.sh
---------
Co-authored-by: HanClinto <hanclinto@gmail.com>
|
|
|
|
* common : gpt_params_parse do not print usage
* common : rework usage print (wip)
* common : valign
* common : rework print_usage
* infill : remove cfg support
* common : reorder args
* server : deduplicate parameters
ggml-ci
* common : add missing header
ggml-ci
* common : remote --random-prompt usages
ggml-ci
* examples : migrate to gpt_params
ggml-ci
* batched-bench : migrate to gpt_params
* retrieval : migrate to gpt_params
* common : change defaults for escape and n_ctx
* common : remove chatml and instruct params
ggml-ci
* common : passkey use gpt_params
|
|
|
|
* tests : minor bash stuff
ggml-ci
* llama : fix build
ggml-ci
* tests : fix CUR_DIR -> ROOT_DIR
ggml-ci
* tests : fix fname
ggml-ci
|
|
* Fix --split-max-size
Byte size calculation was done on int and overflowed.
* add tests.sh
* add examples test scripts to ci run
Will autodiscover examples/*/tests.sh scripts and run them.
* move WORK_PATH to a subdirectory
* clean up before and after test
* explicitly define which scripts to run
* add --split-max-size to readme
|
|
* split by max size
* clean up arg parse
* split: ok
* add dry run option
* error on 0 tensors
* be positive
* remove next_metadata_size
|
|
* llama: llama_split_prefix fix strncpy does not include string termination
common: llama_load_model_from_url:
- fix header name case sensitive
- support downloading additional split in parallel
- hide password in url
* common: EOL EOF
* common: remove redundant LLAMA_CURL_MAX_PATH_LENGTH definition
* common: change max url max length
* common: minor comment
* server: support HF URL options
* llama: llama_model_loader fix log
* common: use a constant for max url length
* common: clean up curl if file cannot be loaded in gguf
* server: tests: add split tests, and HF options params
* common: move llama_download_hide_password_in_url inside llama_download_file as a lambda
* server: tests: enable back Release test on PR
* spacing
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* spacing
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* spacing
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* split: support in llama_model_loader
* avoid copying the entire vector
Co-authored-by: slaren <slarengh@gmail.com>
* split: move llama_tensor_offset to llama_model_loader
* llama_model_loader: PR feedbacks:
- use only one gguf_context for metadata only
- store all ggml_context in a vector as the files and mappings
- store all weights in a vector along with the source tensor
- rename ctx_gguf to meta
- rename ctx_meta to contexts
* avoid copying the entire vector
* Simplify this by making these optional, switch some layer creation tensor optional
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Handle optional tensors
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* llama_model_loader: fail if backend cannot allocate buffer
* fix mmap buffer management
* llama_model_loader: map file to backend buffer if the allocation succeeds only
* llama_model_loader: only map tensors included in the context
* llama_model_loader: minor, use same variable name for consistency, fix spacing in types cast
* llama_model_loader: fail if any of backend buffer cannot be allocated
* spacing
Co-authored-by: slaren <slarengh@gmail.com>
* fix loop over pointer
Co-authored-by: slaren <slarengh@gmail.com>
* llama_model_loader: if n_tensors declared not equals to loaded tensors in split, throw an exception instead of asserting
* llama_model_loader: ensure mappings vector has the expected size
* llama_model_loader: use at instead of operator[] if this should never add to the map.
* llama_model_loader: immediately add the backend buffer to the model buffers in order to free them if an error occurs in the next allocation. Reserve the expected size.
* llama_model_loader: be sure the model mappings has enough capacity before allocating backend buffer
* llama_model_loader: fix map -> unordered map
* llama_split_prefix: use a clearer version, not pass split path len but dest max len.
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* llama : minor
ggml-ci
* llama : introduce some typedef helpers
* docs: add model shard in hot topic
* llama_model_loader: put mapping in a unique_ptr from the moment it is allocated
Co-authored-by: slaren <slarengh@gmail.com>
* fix llama_split_prefix
---------
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
|
|
|
|
* gguf-split: split and merge gguf files per tensor
* gguf-split: build with make toolchain
* gguf-split: rename `--split-tensors-size` to `--split-max-tensors`. Set general.split_count KV to all split
* split : minor style + fix compile warnings
* gguf-split: remove --upload not implemented
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|