Age | Commit message (Collapse) | Author |
|
|
|
|
|
|
|
* ggml-cuda : compute ptrs for cublasGemmBatchedEx in a kernel
* fix warnings
|
|
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>
|
|
|
|
|
|
ggml-ci
|
|
|
|
* llm : add llm_build_context
* llm : deduce norm eps based on type + explict max_alibi_bias, clamp_kqv
* llm : restore the non-graph llm_build_ functional API
ggml-ci
* llm : cleanup + comments
|
|
* Allow caller to handle help/argument exceptions
* Prepend newline to usage output
* Add new gpt_params_parse_ex function to hide arg-parse impl
* Fix issue blocking success case
* exit instead of returning false
* Update common/common.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/common.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* impl --log-new, --log-append
* Update common/log.h
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
* Update common/log.h
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
* Apply suggestions from code review
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
---------
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
|
|
|
|
|
|
* Add '-ngl' support to finetune.cpp
* Add fprintf in ggml_cuda_op_add
When I tried CUDA offloading during finetuning following the readme, I got an assert here.
This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora
* Add 'finetune.sh', which currently fails when using GPU
"error: operator (): Finetuning on tensors with type 'f16' is not yet supported"
* tweak finetune.sh
* Suppress some warnings in ggml.c
* Add f16 implementation to ggml_compute_forward_add_f16_f32
* Add an f16 case to ggml_add_cast_impl and llama_build_lora_finetune_graphs
* finetune.sh: Edit comments
* Add "add_f16_f32_f32_cuda"
* Tweak an error message
* finetune.sh: Add an optional LLAMA_MODEL_DIR variable
* finetune.sh: Add an optional LLAMA_TRAINING_DIR variable
* train : minor
* tabs to spaces
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
|
|
* scripts : add deploy-server.sh
* scripts : rename to server-llm.sh
* scripts : working curl pipe
|
|
|
|
* llama : factor out ggml-alloc from graph graph build functions
ggml-ci
* metal : disable kernel load log
* llama : factor out tensor offloading outside the build call (wip)
ggml-ci
* llama : offload rest of the models
ggml-ci
* llama : update offload log messages to print node index
* llama : comments
* llama : support offloading result_norm + comments
* llama : factor graph input into a function
* llama : do tensor offload only with CUDA
* llama : fix res_norm offloading
* llama : try to optimize offloading code
* llama : fix non-CUDA build
* llama : try to fix build
* llama : move refact in correct place + optimize graph input
* llama : refactor tensor offloading as callback
* llama : add layer index to all tensor names
* llama : add functional header
* llama : comment
ggml-ci
* llama : remove obsolete map for layer counting
* llama : add llm_build helper functions (#3848)
* llama : add llm_build_norm helper function
ggml-ci
* llama : add llm_build_ffn helper function (#3849)
ggml-ci
* llama : add llm_build_k_shift helper
ggml-ci
* llama : fix offloading after recent changes
* llama : add llm_build_kv_store helper
ggml-ci
* llama : remove obsolete offload names
* llama : fix llm_build_k_shift to use n_head_kv instead of n_head
* llama : simplify falcon Q, K, V computation
* llama : remove obsolete comments in build graphs
* llama : add llm_build_kqv helper
ggml-ci
* llama : minor
* llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading
* llama : fix input allocation logic
* llama : update offload functions for KQ tensors
* llama : normalize tensor names
ggml-ci
* llama : enable warning about not offloaded tensors
* llama : remove extra ; + deduplicate gate_b logic
* llama : add llm_build_inp_embd helper
|
|
* Introduce the new Min-P sampler by @kalomaze
The Min-P sampling method was designed as an alternative to Top-P, and aims to ensure a balance of quality and variety. The parameter *p* represents the minimum probability for a token to be considered, relative to the probability of the most likely token.
* Min-P enabled and set to 0.05 default
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
|
|
|
|
* ggml : move FP16 <-> FP32 stuff to ggml-impl.h
ggml-ci
* tests : fix ARM build
* ggml : explicitly initialize deprecated type traits
* ggml : add math.h to ggml-impl.h
* ggml : remove duplicate static assert macros
* ggml : prefix lookup tables with ggml_
ggml-ci
* ggml-impl : move extern "C" to start of file
|
|
* Extend llama_kv_cache_seq_rm to allow matichng any sequence
* Replace llama_kv_cache_tokens_rm with llama_kv_cache_clear
Use llama_kv_cache_clear for cache clearing
Change calls to llama_kv_cache_tokens_rm that want to delete by position to use llama_kv_cache_seq_rm functionality
|
|
|
|
ggml-ci
|
|
* ggml : factor all quantization code in ggml-quants
ggml-ci
* ggml-quants : fix Zig and Swift builds + quantize tool
ggml-ci
* quantize : --pure option for disabling k-quant mixtures
---------
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
|
|
shell (#3797)
* flake : update flake.lock for newer transformers version + provide extra dev shell with torch and transformers (for most convert-xxx.py scripts)
|
|
* Try cwd for ggml-metal if bundle lookup fails
When building with `-DBUILD_SHARED_LIBS=ON -DLLAMA_METAL=ON -DLLAMA_BUILD_SERVER=ON`,
`server` would fail to load `ggml-metal.metal` because `[bundle pathForResource:...]`
returns `nil`. In that case, fall back to `ggml-metal.metal` in the cwd instead of
passing `null` as a path.
Follows up on #1782
* Update ggml-metal.m
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
|
|
|
|
(#3747)
* Allow quantizing k-quants to fall back when tensor size incompatible
* quantizing: Add warning when tensors were incompatible with k-quants
Clean up k-quants state passing a bit
|
|
* llama : add option for greedy sampling with probs
* llama : add comment about llama_sample_token_greedy() missing probs
* sampling : temp == 0.0 -> no probs, temp < 0.0 -> probs
|
|
(#3823)
|
|
* starcoder : do not GPU split 1D bias tensors
* starcoder : offload layers to GPU
ggml-ci
|
|
* speculative: Ensure draft and target model vocab matches
* Tolerate small differences when checking dft vs tgt vocab
|
|
|
|
|
|
* cuda : prints wip
* cuda : new cublas gemm branch for multi-batch quantized src0
* cuda : add F32 sgemm branch
* cuda : fine-tune >= VOLTA params + use MMQ only for small batches
* cuda : remove duplicated cuBLAS GEMM code
* cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros
* build : add compile option to force use of MMQ kernels
|
|
|
|
|
|
|
|
Co-authored-by: Michael Coppola <m18coppola@gmail.com>
Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
|
|
* server : do not block system prompt update
* server : update state machine logic to process system prompts
* server : minor
|
|
ggml-ci
|
|
|
|
* cmake : add helper for faster CUDA builds
* batched : add NGL arg
* ggml : skip nops in compute_forward
* cuda : minor indentation
* cuda : batched cuBLAS GEMMs for src0 F16 and src1 F32 (attention ops)
* Apply suggestions from code review
These changes plus:
```c++
#define cublasGemmBatchedEx hipblasGemmBatchedEx
```
are needed to compile with ROCM. I haven't done performance testing, but it seems to work.
I couldn't figure out how to propose a change for lines outside what the pull changed, also this is the first time trying to create a multi-part review so please forgive me if I mess something up.
* cuda : add ROCm / hipBLAS cublasGemmBatchedEx define
* cuda : add cublasGemmStridedBatchedEx for non-broadcasted cases
* cuda : reduce mallocs in cublasGemmBatchedEx branch
* cuda : add TODO for calling cublas from kernel + using mem pool
---------
Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
|
|
* Add more tokenizer tests
* Add starcoder
* Update test vocab files
* Restrict bpe tokenizer tests to unicode planes
* Update comment
* Comment cosmetics
* Remove bloom vocab/test
|
|
ggml-ci
|
|
This reverts commit 96981f37b1e3f450d9e63e571514217bf60f0a7f.
See:
https://github.com/ggerganov/llama.cpp/pull/2482#issuecomment-1775975866
|
|
|
|
tokenizers (#3746)
We still have the heads up in `README.md` regarding `bpe` tokenizers and this patch is needed for
- a couple of tokenizer tests
- some more `special` and `non-special` added tokens handling (as far as I understand it)
* Update special token handling
* Add mpt
|