summaryrefslogtreecommitdiff
AgeCommit message (Expand)Author
2024-12-03Q8_0_R4 (#120)Kawrakow
2024-12-02Q4_0_R4 (#119)Kawrakow
2024-12-02IQ4_NL_X4 (#118)Kawrakow
2024-11-21Use Q6_0 instead of Q5_1 for tensors incompatible with IQ5_K/Q5_K (#116)Nexes the Elder
2024-11-21MMQ for Q6_0 (#115)Kawrakow
2024-10-31Faster MoE inference (#112)Kawrakow
2024-10-26Use fused mul - unary op also for MoE models (#111)Kawrakow
2024-10-26Bitnet: use the fused mul-silu in the FFN network (#110)Kawrakow
2024-10-26Bitnet CUDA improvements (#109)Kawrakow
2024-10-26Improve Bitnet PP on Metal (#108)Kawrakow
2024-10-26Faster IQ1_BN Metal implementation (#107)Kawrakow
2024-10-25Remove forgotten IQ1_TN, IQ2_TN enum valuesIwan Kawrakow
2024-10-25Bitnet changes (#106)Kawrakow
2024-10-24Fix quantized k-cache without FA (#105)Kawrakow
2024-10-22Add support for Granite and GraniteMoE models (#102)Kawrakow
2024-10-22Enable q6_0 for flash attention (#101)Kawrakow
2024-10-21Enable IQ4_NL for KV-cache in token generation using Flash Attention (#99)Kawrakow
2024-10-20Avoid rebuild of GGML graph for each token (#98)agray3
2024-10-19Bitnet: make the scale tensors optional (#97)Kawrakow
2024-10-19Quant strategies: attn_q Q4 & attn_v Q6 for Llama 3.1 Q5_K_S (#96)Nexes the Elder
2024-10-19Attempt to blindly fix Windows build failure (#93)Kawrakow
2024-10-18CLI - Specify GGML_TYPE to quantize for the main tensors. (#91)Nexes the Elder
2024-10-16Adding IQ4_KSS: 4.0 bpw quants (#89)Kawrakow
2024-10-16iq4_ks: faster dot product on Metal (#90)Kawrakow
2024-10-14Minor iq3_k tweakIwan Kawrakow
2024-10-14iq3_k: fix and optimize Metal dot product (#87)Kawrakow
2024-10-13Fix and optimize iq2k Metal implementation (#86)Kawrakow
2024-10-13IQ2_KS: 2.1875 bpw non-linear quantization (#85)Kawrakow
2024-10-11Minor: printf -> LLAMA_LOG_INFOIwan Kawrakow
2024-10-10Better model info (#84)Kawrakow
2024-10-09New SOTA quantization: 4.25 bpw IQ4_KS (#83)Kawrakow
2024-10-04Fix compiler warningsIwan Kawrakow
2024-10-04Move scale fudge factors to quantization (#81)Kawrakow
2024-10-04Move to c++17 projectwide (#80)Kawrakow
2024-10-04Do not quantize activations if not necessary (#79)Kawrakow
2024-10-02q6_0: Slightly faster Zen4/AVX2 (#78)Kawrakow
2024-10-02Fused unary(x)*y (#70)Kawrakow
2024-10-02Adding Q6_0 (#77)Kawrakow
2024-10-02iq4_nl: faster quantization (#76)Kawrakow
2024-10-01Fix Q5_0 flash attention (#75)Kawrakow
2024-10-01Fix last commitIwan Kawrakow
2024-10-01IQ4_NL kv-cache on the CPU (Zen4/AVX2/ARM_NEON) (#74)Kawrakow
2024-10-01CUDA: faster float -> iq4_nl conversion (#73)Kawrakow
2024-10-01iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2 (#72)Kawrakow
2024-10-01iqk_mul_mat: better srategy when nrc_y not divisible by ny (#71)Kawrakow
2024-09-29Allow bf16 kv-cache (#69)Kawrakow
2024-09-28Time to fix replace_all (#68)Kawrakow
2024-09-28CUDA non-contiguous RoPE (#66)Kawrakow
2024-09-28Adding SWIGLU unary op (#65)Kawrakow
2024-09-28Better sub-3-bit quantization mixes with a qkv tensor (#64)Kawrakow