Age | Commit message (Collapse) | Author |
|
Current status: Working, except for the latest GPTQ-for-LLaMa format
that includes `g_idx`. This turns out to require changes to GGML, so
for now it only works if you use the `--outtype` option to dequantize it
back to f16 (which is pointless except for debugging).
I also included some cleanup for the C++ code.
This script is meant to replace all the existing conversion scripts
(including the ones that convert from older GGML formats), while also
adding support for some new formats. Specifically, I've tested with:
- [x] `LLaMA` (original)
- [x] `llama-65b-4bit`
- [x] `alpaca-native`
- [x] `alpaca-native-4bit`
- [x] LLaMA converted to 'transformers' format using
`convert_llama_weights_to_hf.py`
- [x] `alpaca-native` quantized with `--true-sequential --act-order
--groupsize 128` (dequantized only)
- [x] same as above plus `--save_safetensors`
- [x] GPT4All
- [x] stock unversioned ggml
- [x] ggmh
There's enough overlap in the logic needed to handle these different
cases that it seemed best to move to a single script.
I haven't tried this with Alpaca-LoRA because I don't know where to find
it.
Useful features:
- Uses multiple threads for a speedup in some cases (though the Python
GIL limits the gain, and sometimes it's disk-bound anyway).
- Combines split models into a single file (both the intra-tensor split
of the original and the inter-tensor split of 'transformers' format
files). Single files are more convenient to work with and more
friendly to future changes to use memory mapping on the C++ side. To
accomplish this without increasing memory requirements, it has some
custom loading code which avoids loading whole input files into memory
at once.
- Because of the custom loading code, it no longer depends in PyTorch,
which might make installing dependencies slightly easier or faster...
although it still depends on NumPy and sentencepiece, so I don't know
if there's any meaningful difference. In any case, I also added a
requirements.txt file to lock the dependency versions in case of any
future breaking changes.
- Type annotations checked with mypy.
- Some attempts to be extra user-friendly:
- The script tries to be forgiving with arguments, e.g. you can
specify either the model file itself or the directory containing
it.
- The script doesn't depend on config.json / params.json, just in
case the user downloaded files individually and doesn't have those
handy. But you still need tokenizer.model and, for Alpaca,
added_tokens.json.
- The script tries to give a helpful error message if
added_tokens.json is missing.
|
|
- use f-strings where possible
- drop first param of encode/decode functions since "utf-8" is the default
|
|
This is a breaking change that's going to give you three benefits:
1. Your inference commands should load 100x faster
2. You may be able to safely load models 2x larger
3. You can run many concurrent inference processes
This was accomplished by changing the file format so we can mmap()
weights directly into memory without having to read() or copy them
thereby ensuring the kernel can make its file cache pages directly
accessible to our inference processes; and secondly, that the file
cache pages are much less likely to get evicted (which would force
loads to hit disk) because they're no longer competing with memory
pages that were needlessly created by gigabytes of standard i/o.
The new file format supports single-file models like LLaMA 7b, and
it also supports multi-file models like LLaMA 13B. Our Python tool
now merges the foo.1, foo.2, etc. files back into a single file so
that the C++ code which maps it doesn't need to reshape data every
time. That's made llama.cpp so much simpler. Much of its load code
has now been deleted.
Furthermore, this change ensures that tensors are aligned properly
on a 32-byte boundary. That opens the door to seeing if we can get
additional performance gains on some microprocessors, by using ops
that require memory alignment.
Lastly note that both POSIX and the Windows platform are supported
Fixes #91
|
|
* Fix GPTQ converter
* Fix comment
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* [WIP, broken] Importer for GPTQ quantized LLaMA models
Based on: https://github.com/qwopqwop200/GPTQ-for-LLaMa
Current status: Something is busted. The output starts out decent, but
quickly degrades into gibberish. This doesn't happen with either the
original GPTQ-for-LLaMa using the same weights, or llama.cpp when using
weights quantized by its own quantizer. Is there a bug in the
conversion script that somehow only comes into play with a large context
size?
I did notice one potential issue. It's clearly not the main cause of
the gibberish, since it doesn't happen when using q4_1 weights quantized
by llama.cpp itself, but it seems concerning. When doing a matrix
multiplication of f16 * f32 => f32 or q4_1 * f32 => f32, at least when
the multiplication is not done with BLAS, the intermediate results are
stored in the smaller format rather than f32. This seems like an
unnecessary waste of precision, especially in the q4_1 case.
I was originally hoping to validate the results by matching the Python
implementation's output exactly, but precision and non-associativity
issues make this very difficult, including when performing matrix
multiplications and, especially, computing norms.
Anyway, design details:
The models being imported store per-layer weights in essentially q4_1
format, although the addend and scale are shared across an entire row
rather than every group of 32 weights. This script duplicates the
addend and scale to match ggml's expectations, at the cost of wasting
some memory.
However, there are two differences which I accommodated changing the
output format (and adding corresponding support to main.cpp) rather than
having the script match the existing one:
- The tok_embeddings and output weights (i.e. the weights that aren't
per-layer) are f16 instead of q4_1. They could be converted to q4_1,
and the impact of the loss of precision would probably be low, but
this would rule out exactly matching the Python implementation's
output for validation.
- There is no sharding, since the input doesn't have it, and for a
CPU-only implementation it seems more useful to avoid having to deal
with multiple files.
The new format is differentiated from existing q4_1 format by changing
the 'f16' header flag to a new value, 4. That said, I think a cleaner
approach would be to change main.cpp to support loading each tensor with
an arbitrary sharding configuration and type rather than hardcoding
specific combinations of types. So far I've wasted too much time
debugging to try implementing this...
* Add missing permutation. Now it works.
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|