ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	ubergarm <leimgrub@gmail.com>	2025-04-26 11:34:04 -0400
committer	GitHub <noreply@github.com>	2025-04-26 17:34:04 +0200
commit	baeefb4731fb24cdace168f6dbc74516d470efc0 (patch)
tree	af5314fee78b2ffd037b3dc0fa13dc29ed5384e5 /convert_hf_to_gguf.py
parent	9e846f0eb196ed543cb29753bfd6a21a936a5138 (diff)

Add GLM-4-0414 Model Support (#344)

* Add GLM-4-0414 model support Based on zRzRzRzRzRzRzR's PR on mainline llama.cpp. Still some issues where it doesn't work: * offloading >=60 layers to GPU * no flash attention * Remove seemingly unused llm_tensor enums Both of these seem unused and LLM_TENSOR_ATTN_POST_NORM already existed which seems pretty similar? Don't think they were used in the python code either... So removed these as possibly just cruft: * LLM_TENSOR_POST_ATTN_NORM * LLM_TENSOR_POST_MLP_NORM * Set flash attention precision to f32 on GLM4 arch * Set non flash attention precision to f32 on GLM4 * Remove reshape_3d() for Vcur in build_glm4() This fixes the non-flash-attention inferencing on both CPU and CUDA.

Diffstat (limited to 'convert_hf_to_gguf.py')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: