diff options
author | ubergarm <leimgrub@gmail.com> | 2025-04-26 11:34:04 -0400 |
---|---|---|
committer | GitHub <noreply@github.com> | 2025-04-26 17:34:04 +0200 |
commit | baeefb4731fb24cdace168f6dbc74516d470efc0 (patch) | |
tree | af5314fee78b2ffd037b3dc0fa13dc29ed5384e5 /convert_hf_to_gguf.py | |
parent | 9e846f0eb196ed543cb29753bfd6a21a936a5138 (diff) |
Add GLM-4-0414 Model Support (#344)
* Add GLM-4-0414 model support
Based on zRzRzRzRzRzRzR's PR on mainline llama.cpp.
Still some issues where it doesn't work:
* offloading >=60 layers to GPU
* no flash attention
* Remove seemingly unused llm_tensor enums
Both of these seem unused and LLM_TENSOR_ATTN_POST_NORM already
existed which seems pretty similar? Don't think they were used in the
python code either...
So removed these as possibly just cruft:
* LLM_TENSOR_POST_ATTN_NORM
* LLM_TENSOR_POST_MLP_NORM
* Set flash attention precision to f32 on GLM4 arch
* Set non flash attention precision to f32 on GLM4
* Remove reshape_3d() for Vcur in build_glm4()
This fixes the non-flash-attention inferencing on both CPU and CUDA.
Diffstat (limited to 'convert_hf_to_gguf.py')
0 files changed, 0 insertions, 0 deletions