ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Kawrakow <iwankawrakow@gmail.com>	2025-03-09 16:53:55 +0200
committer	GitHub <noreply@github.com>	2025-03-09 16:53:55 +0200
commit	b096a5de7a9bdf516bb20729d5d0a3b2a12cba2f (patch)
tree	5063aaad36537b062c5e6ee580854247542816cc /examples
parent	81748fb55e474ef1ddb3c64c14f7c378f0f6cd8b (diff)

This works on CUDA, but (#247)

PP speed is great, almost on par with standard FA. But TG speed is pathetic. The strangest thing is that the slowdown is not due to FA, but due to the ffn_gate_exps gemm, which somehow becomes very slow. WTF? As I'm unable the resolve the slow ffn_gate_exps GEMM mystery, for now TG goes via mla=2, PP is via FA. Also discovered the ggml_cast op, so we don't need the aux tensors that I had added to the KV cache. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Diffstat (limited to 'examples')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: