ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2024-06-22	bitnet: put the scale in a separate tensor	Iwan Kawrakow
	and correspondingly add an extra ggml_mul_mat operation. As per @ggerganov, this is how things should be done. It seems to be working, but as far as I can tell this results in a ~15% performance penalty for prompt processing. Commiting so I can go and test on othe platforms.
2024-06-22	Bitnet(1.75 bpw): higher precision fp8 scale	Iwan Kawrakow
	Use 3 bits for the exponent and 5 bits for the mantissa. This makes PPL to be the same as fp16 (but the previous version with 4 bits for the exponent and mantissa was good enough for any practical purposes).
2024-06-22	Bitnet: 2.25 bpw version	Iwan Kawrakow
	Just scaler and AVX2 for now. PP-512 is even faster (325 t/s on the Ryzn-7950X, 404 t/s on Ryzen-5975WX). We lose ~6-7% for TG due to being memory bound and the model being 10% larger.
2024-06-22	bitnet: add 2 bpw quantization	Iwan Kawrakow
	The scalar dot product already chieves 37 t/s for TG!
2024-06-22	Move Q8_K64 quantization to iqk-quantize.cpp and add copyright notice	Iwan Kawrakow

2024-06-22	bitnet: fix scalar dot product	Iwan Kawrakow
	I had forgotten to adjust for the change to q8_K64. On the M2 I'm getting 10.8 t/s with the scalar version!
2024-06-22	bitnet: python + llama	Iwan Kawrakow