Age | Commit message (Collapse) | Author |
|
* llama: add codeshell support
* llama.cpp: fix codeshell with NeoX rope
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* convert : update phi-2 to latest HF repo
ggml-ci
* py : try to fix flake stuff
|
|
* llama : fix llm_build_k_shift to use correct n_rot
ggml-ci
* llama : always use hparams.n_rot for ggml_rope_custom
ggml-ci
* convert : fix persimmon conversion to write correct n_rot
|
|
|
|
* update: awq support llama-7b model
* update: change order
* update: benchmark results for llama2-7b
* update: mistral 7b v1 benchmark
* update: support 4 models
* fix: Readme
* update: ready for PR
* update: readme
* fix: readme
* update: change order import
* black
* format code
* update: work for bot mpt and awqmpt
* update: readme
* Rename to llm_build_ffn_mpt_awq
* Formatted other files
* Fixed params count
* fix: remove code
* update: more detail for mpt
* fix: readme
* fix: readme
* update: change folder architecture
* fix: common.cpp
* fix: readme
* fix: remove ggml_repeat
* update: cicd
* update: cicd
* uppdate: remove use_awq arg
* update: readme
* llama : adapt plamo to new ffn
ggml-ci
---------
Co-authored-by: Trần Đức Nam <v.namtd12@vinai.io>
Co-authored-by: Le Hoang Anh <v.anhlh33@vinai.io>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* add plamo mock
* add tensor loading
* plamo convert
* update norm
* able to compile
* fix norm_rms_eps hparam
* runnable
* use inp_pos
* seems ok
* update kqv code
* remove develop code
* update README
* shuffle attn_q.weight and attn_output.weight for broadcasting
* remove plamo_llm_build_kqv and use llm_build_kqv
* fix style
* update
* llama : remove obsolete KQ_scale
* plamo : fix tensor names for correct GPU offload
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* phi2 implementation
* fix breaking change
* phi-2 : various fixes
* phi-2 : use layer norm eps
* py : whitespaces
* llama : fix meta KV override bug
* convert : phi don't add BOS token
* convert : revert "added_tokens_decoder" change
* phi-2 : scale Q instead of KQ for better precision
* ggml : fix NeoX rope to rotate just first n_dims
* cuda : less diff in the rope_neox kernel
* ggml : add ggml_mul_mat_set_prec
ggml-ci
* Update ggml-cuda.cu
Co-authored-by: slaren <slarengh@gmail.com>
* Update ggml-cuda.cu
Co-authored-by: slaren <slarengh@gmail.com>
* cuda : ggml_cuda_op_mul_mat_cublas support F32 precision
* cuda : remove oboslete comment
---------
Co-authored-by: Ebey Abraham <ebeyabraham@microsoft.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
|
|
* convert : support Mixtral as LLAMA arch
* convert : fix n_ff typo
* llama : model loading
* ggml : sync latest ggml_mul_mat_id
* llama : update graph to support MoE
* llama : fix cur -> cur_expert
* llama : first working version
* llama : fix expert weighting in the FFN
* ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only)
* ggml : add n_as argument to ggml_mul_mat_id
* ggml : fix ggml_get_rows to take into account ne02 / ne11
* metal : add more general support for ggml_get_rows + tests
* llama : add basic support for offloading moe with CUDA
* metal : add/mul/div use general kernel when src1 not cont
* metal : reduce the kernel launches for ggml_mul_mat_id
* ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D
* ggml : update get_rows f16 and q
* cuda : support non-contiguous src1 in get_rows
* llama : offload missing ffn_moe_silu
* metal : fix ggml_get_rows to work with non-cont src1
* metal : add indirect mat-vec kernels for all quantization types
* llama : do not quantize expert gating tensors
* llama : add n_expert and n_expert_used to hparams + change quants
* test-backend-ops : add moe test
* cuda : fix get_rows when ncols is odd
* convert : determine n_ctx correctly
* metal : fix ggml_mul_mat_id for F32
* test-backend-ops : make experts more evenly probable (test_moe)
* test-backend-ops : cleanup, add moe test for batches
* test-backend-ops : add cpy from f32 -> all types test
* test-backend-ops : fix dequantize block offset
* llama : fix hard-coded number of experts
* test-backend-ops : simplify and disable slow tests to avoid CI timeout
* test-backend-ops : disable MOE test with thread sanitizer
* cuda : fix mul_mat_id with multi gpu
* convert : use 1e6 rope_freq_base for mixtral
* convert : fix style
* convert : support safetensors format
* gguf-py : bump version
* metal : add cpy f16 -> f32 kernel
* metal : fix binary ops for ne10 % 4 != 0
* test-backend-ops : add one more sum_rows test
* ggml : do not use BLAS with ggml_mul_mat_id
* convert-hf : support for mixtral-instruct (#4428)
* convert : typo fix, add additional hyperparameters, use LLaMA arch for Mixtral-instruct
* convert : use sentencepiece tokenizer for Mixtral-instruct
* convert : make flake8 happy
* metal : fix soft_max kernels
ref: https://github.com/ggerganov/ggml/pull/621/commits/1914017863d2f9ab8ecc0281cc2a56d683668b92
* metal : limit kernels to not use more than the allowed threads
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Radek Pilar <github@mrkva.eu>
|
|
* enable qwen to llama.cpp
* llama : do not GPU split bias tensors
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* gguf-py: Refactor and add file reading support
* Replay changes from #3871
Credit to @cebtenzzre for that pull
* Various type annotation fixes.
* sort imports with isort (again)
* Fix missing return statement in add_tensor
* style cleanup with flake8
* fix NamedTuple and Enum usage
* Fix an issue with state init in GGUFReader
Move examples to an examples/ directory
Clean up examples
Add an example of modifying keys in a GGUF file
Update documentation with info on examples
Try to support people importing gguf/gguf.py directly
* Damagage is not a word.
* Clean up gguf-py/examples/modify_gguf.py whitespace
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Update gguf-py/examples/modify_gguf.py formatting
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Update gguf-py/gguf/gguf_reader.py type hint
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Make examples executable, formatting changes
* Add more information to GGUFReader and examples comments
* Include a gguf Python package version bump
* Add convert-gguf-endian.py script
* cleanup
* gguf-py : bump minor version
* Reorganize scripts
* Make GGUFReader endian detection less arbitrary
* Add JSON dumping support to gguf-dump.py
Which I kind of regret now
* A few for gguf-dump.py cleanups
* Murder accidental tuple in gguf-py/scripts/gguf-dump.py
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* cleanup
* constants : remove unneeded type annotations
* fix python 3.8 compat
* Set up gguf- scripts in pyproject.toml
* And include scripts/__init__.py, derp
* convert.py: We can't currently support Q8_0 on big endian.
* gguf-py: SpecialVocab: Always try available sources for special token ids
gguf-py: SpecialVocab: Try to load merges from merges.txt if not in tokenizer.json
gguf-py: SpecialVocab: Add 'add_bos_token' type bools to GGUF metadata
u
* cleanup
* Promote add_X_token to GGUF metadata for BOS and EOS
---------
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
|