ik_llama.cpp.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Shouzheng Liu <lshzh.hi@gmail.com>	2023-08-16 16:07:04 -0400
committer	GitHub <noreply@github.com>	2023-08-16 23:07:04 +0300
commit	bf83bff6742c0f1795b4c18695a13a34ac7adf62 (patch)
tree	1f1d4e77bf04c459686961540d3e359e8aceb519 /examples
parent	b5ffb2849d23afe73647f68eec7b68187af09be6 (diff)

metal : matrix-matrix multiplication kernel (#2615)

* metal: matrix-matrix multiplication kernel This commit removes MPS and uses custom matrix-matrix multiplication kernels for all quantization types. This commit also adds grouped-query attention to support llama2 70B. * metal: fix performance degradation from gqa Integers are slow on the GPU, and 64-bit divides are extremely slow. In the context of GQA, we introduce a 64-bit divide that cannot be optimized out by the compiler, which results in a decrease of ~8% in inference performance. This commit fixes that issue by calculating a part of the offset with a 32-bit divide. Naturally, this limits the size of a single matrix to ~4GB. However, this limitation should suffice for the near future. * metal: fix bugs for GQA and perplexity test. I mixed up ne02 and nb02 in previous commit.

Diffstat (limited to 'examples')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: