summaryrefslogtreecommitdiff
path: root/convert_hf_to_gguf.py
diff options
context:
space:
mode:
authorubergarm <leimgrub@gmail.com>2025-04-26 11:34:04 -0400
committerGitHub <noreply@github.com>2025-04-26 17:34:04 +0200
commitbaeefb4731fb24cdace168f6dbc74516d470efc0 (patch)
treeaf5314fee78b2ffd037b3dc0fa13dc29ed5384e5 /convert_hf_to_gguf.py
parent9e846f0eb196ed543cb29753bfd6a21a936a5138 (diff)
Add GLM-4-0414 Model Support (#344)
* Add GLM-4-0414 model support Based on zRzRzRzRzRzRzR's PR on mainline llama.cpp. Still some issues where it doesn't work: * offloading >=60 layers to GPU * no flash attention * Remove seemingly unused llm_tensor enums Both of these seem unused and LLM_TENSOR_ATTN_POST_NORM already existed which seems pretty similar? Don't think they were used in the python code either... So removed these as possibly just cruft: * LLM_TENSOR_POST_ATTN_NORM * LLM_TENSOR_POST_MLP_NORM * Set flash attention precision to f32 on GLM4 arch * Set non flash attention precision to f32 on GLM4 * Remove reshape_3d() for Vcur in build_glm4() This fixes the non-flash-attention inferencing on both CPU and CUDA.
Diffstat (limited to 'convert_hf_to_gguf.py')
0 files changed, 0 insertions, 0 deletions