awq-py/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116

# AWQ: Activation-aware Weight Quantization for LLM - version apply to llamacpp
[[Paper](https://arxiv.org/abs/2306.00978)][[Original Repo](https://github.com/mit-han-lab/llm-awq)][[Easy-to-use Repo](https://github.com/casper-hansen/AutoAWQ)]

**Supported models:**

- [X] LLaMA
- [x] LLaMA 2
- [X] MPT
- [X] Mistral AI v0.1
- [ ] Bloom
- [ ] Mixtral MoE

**TODO:**
- [x] Update version work with both MPT and MPT-AWQ model
- [ ] Add OPT model
- [ ] Add Bloom model
- [ ] Add Mixtral MoE
- [ ] Support w3, w2


## Contents

- [Install](##Install)
- [Convert](##Convert)
- [Quantize](##Quantize)
- [Test](##Test)
- [Benchmark](##Benchmark)
- [Results](##Results)

## Install
Install requirements
```bash
pip install -r requirements.txt
```
Get the pre-computed AWQ search results for multiple model families, including LLaMA, LLaMA2, MPT, OPT
```bash
git clone https://huggingface.co/datasets/mit-han-lab/awq-model-zoo awq_cache
```

## Convert
Example for llama model
```bash
# For llama7b and llama2 models
python convert.py models/llama-7b/ --awq-path awq_cache/llama-7b-w4-g128.pt --outfile models/llama_7b_fp16.gguf
# For mistral and mpt models
python convert-hf-to-gguf.py models/mpt-7b/ --awq-path awq_cache/mpt-7b-w4-g128.pt --outfile models/mpt_7b_fp16.gguf
```

## Quantize
```bash
# We only benchmark and confirm the results on q4_0, q4_1, and q2_k types.
./quantize models/llama_7b_fp16.gguf models/llama_7b_q4_0.gguf q4_0
```

## Test
```bash
# For all models.
./build/bin/main -m models/llama_7b_q4_0.gguf -n 128 --prompt "Once upon a time"
```

## Benchmark
The perplexity measurements in table above are done against the `wikitext2` test dataset (https://paperswithcode.com/dataset/wikitext-2), with context length of 512.
```bash
# For llama and llama2, and mistral models.
./perplexity -m models/llama_7b_q4_0.gguf -f datasets/wikitext-2-raw/wiki.test.raw
```

## Results
Results are run on OpenBLAS (CPU) and CuBLAS (GPU) for fair comparison
We use three types of llamacpp quantization methods to work with our version, including q4_0, q4_1, and q2_k

### Llama 7B (Build with OpenBLAS)

| Model      | Measure      | F16    | Q4_0   | Q4_1   | Q2_K   |
|-----------:|--------------|-------:|-------:|-------:|-------:|
|Llama 7B    | perplexity   | 5.9066 | 6.1214 | 6.0643 | 6.5808 |
|Llama 7B    | file size    |  12.9G  |   3.5G |   3.9G |   2.7G |
|Llama 7B    | bits/weight  |   16.0 |    4.5 |    5.0 |    2.6 |
|AWQ-LLama 7B| perplexity   | 5.9175 | 6.0252 | 5.9987 | 6.3692 |
|AWQ-LLama 7B| file size    |  12.9G  |   3.5G |   3.9G |   2.7G |
|AWQ-LLama 7B| bits/weight  |   16.0 |    4.5 |    5.0 |    2.6 |


### Llama2 7B (Build with CuBLAS)

| Model       | Measure      | F16    | Q4_0   | Q4_1   | Q2_K   |
|------------:|--------------|-------:|-------:|-------:|-------:|
|Llama2 7B    | perplexity   | 5.8664 | 6.0260 | 6.0656 | 6.4496 |
|Llama2 7B    | file size    |  12.9G  |   3.5G |   3.9G |   2.7G |
|Llama2 7B    | bits/weight  |   16.0 |    4.5 |    5.0 |    2.6 |
|AWQ-LLama2 7B| perplexity   | 5.8801 | 6.0054 | 5.9849 | 6.3650 |
|AWQ-LLama2 7B| file size    |  12.9G  |   3.5G |   3.9G |   2.7G |
|AWQ-LLama2 7B| bits/weight  |   16.0 |    4.5 |    5.0 |    2.6 |


### Mistral 7B v0.1 (Build with CuBLAS)

| Model        | Measure      | F16    | Q4_0   | Q4_1   | Q2_K   |
|-------------:|--------------|-------:|-------:|-------:|-------:|
|Mistral 7B    | perplexity   | 5.6931 | 5.8202 | 5.8268 | 6.1645 |
|Mistral 7B    | file size     |  14.5G |   4.1G |   4.5G |   3.1G |
|Mistral 7B    | bits/weight  |   16.0 |    4.5 |    5.0 |    2.6 |
|AWQ-Mistral 7B| perplexity   | 5.6934 | 5.8020 | 5.7691 | 6.0426 |
|AWQ-Mistral 7B| file size     |  14.5G |   4.1G |   4.5G |   3.1G |
|AWQ-Mistral 7B| bits/weight  |   16.0 |    4.5 |    5.0 |    2.6 |

### MPT 7B (Build with OpenBLAS)

| Model    | Measure      | F16    | Q4_0   | Q4_1   | Q2_K    |
|---------:|--------------|-------:|-------:|-------:|--------:|
|MPT 7B    | perplexity   | 8.4369 | 8.7956 | 8.6265 | 11.4913 |
|MPT 7B    | file size    |  13.7G  |   3.9G |   4.3G |   2.8G  |
|MPT 7B    | bits/weight  |   16.0 |    4.5 |    5.0 |    2.6  |
|AWQ-MPT 7B| perplexity   | 8.4944 | 8.7053 |  8.6750 | 10.2873|
|AWQ-MPT 7B| file size    |  13.7G  |   3.9G |   4.3G |   2.8G  |
|AWQ-MPT 7B| bits/weight  |   16.0 |    4.5 |    5.0 |    2.6  |