summaryrefslogtreecommitdiff
path: root/examples
diff options
context:
space:
mode:
authorKawrakow <iwankawrakow@gmail.com>2025-03-09 16:53:55 +0200
committerGitHub <noreply@github.com>2025-03-09 16:53:55 +0200
commitb096a5de7a9bdf516bb20729d5d0a3b2a12cba2f (patch)
tree5063aaad36537b062c5e6ee580854247542816cc /examples
parent81748fb55e474ef1ddb3c64c14f7c378f0f6cd8b (diff)
This works on CUDA, but (#247)
PP speed is great, almost on par with standard FA. But TG speed is pathetic. The strangest thing is that the slowdown is not due to FA, but due to the ffn_gate_exps gemm, which somehow becomes very slow. WTF? As I'm unable the resolve the slow ffn_gate_exps GEMM mystery, for now TG goes via mla=2, PP is via FA. Also discovered the ggml_cast op, so we don't need the aux tensors that I had added to the KV cache. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'examples')
0 files changed, 0 insertions, 0 deletions