diff options
author | Kawrakow <iwankawrakow@gmail.com> | 2025-03-09 16:53:55 +0200 |
---|---|---|
committer | GitHub <noreply@github.com> | 2025-03-09 16:53:55 +0200 |
commit | b096a5de7a9bdf516bb20729d5d0a3b2a12cba2f (patch) | |
tree | 5063aaad36537b062c5e6ee580854247542816cc /examples | |
parent | 81748fb55e474ef1ddb3c64c14f7c378f0f6cd8b (diff) |
This works on CUDA, but (#247)
PP speed is great, almost on par with standard FA.
But TG speed is pathetic. The strangest thing is that
the slowdown is not due to FA, but due to the ffn_gate_exps
gemm, which somehow becomes very slow. WTF?
As I'm unable the resolve the slow ffn_gate_exps GEMM mystery,
for now TG goes via mla=2, PP is via FA.
Also discovered the ggml_cast op, so we don't need the aux
tensors that I had added to the KV cache.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Diffstat (limited to 'examples')
0 files changed, 0 insertions, 0 deletions