From 8f43e551038af2547b5c01d0e9edd641c0e4bd29 Mon Sep 17 00:00:00 2001
From: Kawrakow <48489457+ikawrakow@users.noreply.github.com>
Date: Mon, 12 Aug 2024 15:14:32 +0200
Subject: Merge mainline - Aug 12 2024 (#17)

* Merge mainline

* Fix after merge

* Remove CI check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
---
 docs/backend/SYCL.md | 145 +++++++++++++++++++++++++++++++++++++--------------
 docs/build.md        |  19 ++++++-
 2 files changed, 124 insertions(+), 40 deletions(-)

(limited to 'docs')

diff --git a/docs/backend/SYCL.md b/docs/backend/SYCL.md
index d36ac0a1..59a39fbb 100644
--- a/docs/backend/SYCL.md
+++ b/docs/backend/SYCL.md
@@ -80,7 +80,14 @@ The following release is verified with good quality:
 
 ### Intel GPU
 
-**Verified devices**
+SYCL backend supports Intel GPU Family:
+
+- Intel Data Center Max Series
+- Intel Flex Series, Arc Series
+- Intel Built-in Arc GPU
+- Intel iGPU in Core CPU (11th Generation Core CPU and newer, refer to [oneAPI supported GPU](https://www.intel.com/content/www/us/en/developer/articles/system-requirements/intel-oneapi-base-toolkit-system-requirements.html#inpage-nav-1-1)).
+
+#### Verified devices
 
 | Intel GPU                     | Status  | Verified Model                        |
 |-------------------------------|---------|---------------------------------------|
@@ -88,7 +95,7 @@ The following release is verified with good quality:
 | Intel Data Center Flex Series | Support | Flex 170                              |
 | Intel Arc Series              | Support | Arc 770, 730M, Arc A750               |
 | Intel built-in Arc GPU        | Support | built-in Arc GPU in Meteor Lake       |
-| Intel iGPU                    | Support | iGPU in i5-1250P, i7-1260P, i7-1165G7 |
+| Intel iGPU                    | Support | iGPU in 13700k, i5-1250P, i7-1260P, i7-1165G7 |
 
 *Notes:*
 
@@ -237,6 +244,13 @@ Similarly, user targeting Nvidia GPUs should expect at least one SYCL-CUDA devic
 ### II. Build llama.cpp
 
 #### Intel GPU
+
+```
+./examples/sycl/build.sh
+```
+
+or
+
 ```sh
 # Export relevant ENV variables
 source /opt/intel/oneapi/setvars.sh
@@ -276,23 +290,26 @@ cmake --build build --config Release -j -v
 
 ### III. Run the inference
 
-1. Retrieve and prepare model
+#### Retrieve and prepare model
 
 You can refer to the general [*Prepare and Quantize*](README.md#prepare-and-quantize) guide for model prepration, or simply download [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf) model as example.
 
-2. Enable oneAPI running environment
+##### Check device
+
+1. Enable oneAPI running environment
 
 ```sh
 source /opt/intel/oneapi/setvars.sh
 ```
 
-3. List devices information
+2. List devices information
 
 Similar to the native `sycl-ls`, available SYCL devices can be queried as follow:
 
 ```sh
 ./build/bin/llama-ls-sycl-device
 ```
+
 This command will only display the selected backend that is supported by SYCL. The default backend is level_zero. For example, in a system with 2 *intel GPU* it would look like the following:
 ```
 found 2 SYCL devices:
@@ -304,12 +321,37 @@ found 2 SYCL devices:
 | 1|[level_zero:gpu:1]|                    Intel(R) UHD Graphics 770|       1.3|         32|     512|     32|    53651849216|
 ```
 
+#### Choose level-zero devices
+
+|Chosen Device ID|Setting|
+|-|-|
+|0|`export ONEAPI_DEVICE_SELECTOR="level_zero:1"` or no action|
+|1|`export ONEAPI_DEVICE_SELECTOR="level_zero:1"`|
+|0 & 1|`export ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1"`|
+
+#### Execute
+
+Choose one of following methods to run.
+
+1. Script
+
+- Use device 0:
+
+```sh
+./examples/sycl/run_llama2.sh 0
+```
+- Use multiple devices:
+
+```sh
+./examples/sycl/run_llama2.sh
+```
 
-4. Launch inference
+2. Command line
+Launch inference
 
 There are two device selection modes:
 
-- Single device: Use one device target specified by the user.
+- Single device: Use one device assigned by user. Default device id is 0.
 - Multiple devices: Automatically choose the devices with the same backend.
 
 In two device selection modes, the default SYCL backend is level_zero, you can choose other backend supported by SYCL by setting environment variable ONEAPI_DEVICE_SELECTOR.
@@ -326,11 +368,6 @@ Examples:
 ```sh
 ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0
 ```
-or run by script:
-
-```sh
-./examples/sycl/run_llama2.sh 0
-```
 
 - Use multiple devices:
 
@@ -338,12 +375,6 @@ or run by script:
 ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm layer
 ```
 
-Otherwise, you can run the script:
-
-```sh
-./examples/sycl/run_llama2.sh
-```
-
 *Notes:*
 
 - Upon execution, verify the selected device(s) ID(s) in the output log, which can for instance be displayed as follow:
@@ -390,7 +421,7 @@ c. Verify installation
 In the oneAPI command line, run the following to print the available SYCL devices:
 
 ```
-sycl-ls
+sycl-ls.exe
 ```
 
 There should be one or more *level-zero* GPU devices displayed as **[ext_oneapi_level_zero:gpu]**. Below is example of such output detecting an *intel Iris Xe* GPU as a Level-zero SYCL device:
@@ -411,6 +442,18 @@ b. The new Visual Studio will install Ninja as default. (If not, please install
 
 ### II. Build llama.cpp
 
+You could download the release package for Windows directly, which including binary files and depended oneAPI dll files.
+
+Choose one of following methods to build from source code.
+
+1. Script
+
+```sh
+.\examples\sycl\win-build-sycl.bat
+```
+
+2. CMake
+
 On the oneAPI command line window, step into the llama.cpp main directory and run the following:
 
 ```
@@ -425,12 +468,8 @@ cmake -B build -G "Ninja" -DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPI
 cmake --build build --config Release -j
 ```
 
-Otherwise, run the `win-build-sycl.bat` wrapper which encapsulates the former instructions:
-```sh
-.\examples\sycl\win-build-sycl.bat
-```
-
 Or, use CMake presets to build:
+
 ```sh
 cmake --preset x64-windows-sycl-release
 cmake --build build-x64-windows-sycl-release -j --target llama-cli
@@ -442,7 +481,9 @@ cmake --preset x64-windows-sycl-debug
 cmake --build build-x64-windows-sycl-debug -j --target llama-cli
 ```
 
-Or, you can use Visual Studio to open llama.cpp folder as a CMake project. Choose the sycl CMake presets (`x64-windows-sycl-release` or `x64-windows-sycl-debug`) before you compile the project.
+3. Visual Studio
+
+You can use Visual Studio to open llama.cpp folder as a CMake project. Choose the sycl CMake presets (`x64-windows-sycl-release` or `x64-windows-sycl-debug`) before you compile the project.
 
 *Notes:*
 
@@ -450,23 +491,25 @@ Or, you can use Visual Studio to open llama.cpp folder as a CMake project. Choos
 
 ### III. Run the inference
 
-1. Retrieve and prepare model
+#### Retrieve and prepare model
 
-You can refer to the general [*Prepare and Quantize*](README#prepare-and-quantize) guide for model prepration, or simply download [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf) model as example.
+You can refer to the general [*Prepare and Quantize*](README.md#prepare-and-quantize) guide for model prepration, or simply download [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf) model as example.
 
-2. Enable oneAPI running environment
+##### Check device
+
+1. Enable oneAPI running environment
 
 On the oneAPI command line window, run the following and step into the llama.cpp directory:
 ```
 "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64
 ```
 
-3. List devices information
+2. List devices information
 
 Similar to the native `sycl-ls`, available SYCL devices can be queried as follow:
 
 ```
-build\bin\ls-sycl-device.exe
+build\bin\llama-ls-sycl-device.exe
 ```
 
 This command will only display the selected backend that is supported by SYCL. The default backend is level_zero. For example, in a system with 2 *intel GPU* it would look like the following:
@@ -478,10 +521,28 @@ found 2 SYCL devices:
 | 0|[level_zero:gpu:0]|               Intel(R) Arc(TM) A770 Graphics|       1.3|        512|    1024|     32|    16225243136|
 | 1|[level_zero:gpu:1]|                    Intel(R) UHD Graphics 770|       1.3|         32|     512|     32|    53651849216|
 
+```
+#### Choose level-zero devices
+
+|Chosen Device ID|Setting|
+|-|-|
+|0|`set ONEAPI_DEVICE_SELECTOR="level_zero:1"` or no action|
+|1|`set ONEAPI_DEVICE_SELECTOR="level_zero:1"`|
+|0 & 1|`set ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1"`|
+
+#### Execute
+
+Choose one of following methods to run.
+
+1. Script
+
+```
+examples\sycl\win-run-llama2.bat
 ```
 
+2. Command line
 
-4. Launch inference
+Launch inference
 
 There are two device selection modes:
 
@@ -508,11 +569,7 @@ build\bin\llama-cli.exe -m models\llama-2-7b.Q4_0.gguf -p "Building a website ca
 ```
 build\bin\llama-cli.exe -m models\llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0 -sm layer
 ```
-Otherwise, run the following wrapper script:
 
-```
-.\examples\sycl\win-run-llama2.bat
-```
 
 Note:
 
@@ -526,17 +583,18 @@ Or
 use 1 SYCL GPUs: [0] with Max compute units:512
 ```
 
+
 ## Environment Variable
 
 #### Build
 
 | Name               | Value                             | Function                                    |
 |--------------------|-----------------------------------|---------------------------------------------|
-| GGML_SYCL          | ON (mandatory)                    | Enable build with SYCL code path.           |
+| GGML_SYCL          | ON (mandatory)                    | Enable build with SYCL code path.<br>FP32 path - recommended for better perforemance than FP16 on quantized model|
 | GGML_SYCL_TARGET   | INTEL *(default)* \| NVIDIA       | Set the SYCL target device type.            |
 | GGML_SYCL_F16      | OFF *(default)* \|ON *(optional)* | Enable FP16 build with SYCL code path.      |
-| CMAKE_C_COMPILER   | icx                               | Set *icx* compiler for SYCL code path.      |
-| CMAKE_CXX_COMPILER | icpx *(Linux)*, icx *(Windows)*   | Set `icpx/icx` compiler for SYCL code path. |
+| CMAKE_C_COMPILER   | `icx` *(Linux)*, `icx/cl` *(Windows)* | Set `icx` compiler for SYCL code path.      |
+| CMAKE_CXX_COMPILER | `icpx` *(Linux)*, `icx` *(Windows)*   | Set `icpx/icx` compiler for SYCL code path. |
 
 #### Runtime
 
@@ -572,9 +630,18 @@ use 1 SYCL GPUs: [0] with Max compute units:512
   ```
   Otherwise, please double-check the GPU driver installation steps.
 
+- Can I report Ollama issue on Intel GPU to llama.cpp SYCL backend?
+
+  No. We can't support Ollama issue directly, because we aren't familiar with Ollama.
+
+  Sugguest reproducing on llama.cpp and report similar issue to llama.cpp. We will surpport it.
+
+  It's same for other projects including llama.cpp SYCL backend.
+
+
 ### **GitHub contribution**:
 Please add the **[SYCL]** prefix/tag in issues/PRs titles to help the SYCL-team check/address them without delay.
 
 ## TODO
 
-- Support row layer split for multiple card runs.
+- NA
diff --git a/docs/build.md b/docs/build.md
index d9d12c46..8b16d1a3 100644
--- a/docs/build.md
+++ b/docs/build.md
@@ -178,7 +178,11 @@ For Jetson user, if you have Jetson Orin, you can try this: [Offical Support](ht
   cmake --build build --config Release
   ```
 
-The environment variable [`CUDA_VISIBLE_DEVICES`](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars) can be used to specify which GPU(s) will be used. The following compilation options are also available to tweak performance:
+The environment variable [`CUDA_VISIBLE_DEVICES`](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars) can be used to specify which GPU(s) will be used.
+
+The environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` can be used to enable unified memory in Linux. This allows swapping to system RAM instead of crashing when the GPU VRAM is exhausted. In Windows this setting is available in the NVIDIA control panel as `System Memory Fallback`.
+
+The following compilation options are also available to tweak performance:
 
 | Option                        | Legal values           | Default | Description                                                                                                                                                                                                                                                                             |
 |-------------------------------|------------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
@@ -192,6 +196,19 @@ The environment variable [`CUDA_VISIBLE_DEVICES`](https://docs.nvidia.com/cuda/c
 | GGML_CUDA_PEER_MAX_BATCH_SIZE | Positive integer       | 128     | Maximum batch size for which to enable peer access between multiple GPUs. Peer access requires either Linux or NVLink. When using NVLink enabling peer access for larger batch sizes is potentially beneficial.                                                                         |
 | GGML_CUDA_FA_ALL_QUANTS       | Boolean                | false   | Compile support for all KV cache quantization type (combinations) for the FlashAttention CUDA kernels. More fine-grained control over KV cache size but compilation takes much longer.                                                                                                  |
 
+### MUSA
+
+- Using `make`:
+  ```bash
+  make GGML_MUSA=1
+  ```
+- Using `CMake`:
+
+  ```bash
+  cmake -B build -DGGML_MUSA=ON
+  cmake --build build --config Release
+  ```
+
 ### hipBLAS
 
 This provides BLAS acceleration on HIP-supported AMD GPUs.
-- 
cgit v1.2.3