summaryrefslogtreecommitdiff
path: root/examples
diff options
context:
space:
mode:
authorJidongZhang-THU <1119708529@qq.com>2024-01-31 21:10:15 +0800
committerGitHub <noreply@github.com>2024-01-31 15:10:15 +0200
commit15606309a05ccf7fadbaad5538cb7c32acb1e06b (patch)
treeaae8b8e0977922438c1e514e961f7c8bea2dcb9a /examples
parentb2b9f025e7821e78bd501d75d01838c26de07a57 (diff)
llava : add MobileVLM support (#5132)
* New Feature: 1. Sum_Rows: fix cuda kernel overflow fix block shape error when nrows too big 2. Im2Col: Support Batch in cuda Support f32 to f32 both in cpu && cuda 3. DepthWiseConv: Support by Im2Col && MulMat 4. Pool_2d: Supoort avg pooling in cuda 5. HardSigmoid: Imp in cuda 6. HardSwish: Imp in cuda * fix tabs instead of spaces * code clean * CUDA POOL2D * ADD POOL2D test case in test-backend-ops.cpp * code clean * fix pool2d_kernel nits * fix bug in pool2d kernel * fix avg pooling, count_include_pad nits * test-backend-ops : add more pool_2d tests * cuda : fix warnings and formatting * ggml : check types in release builds too in pool_2d * test-backend-ops : remove f16 pool_2d tests * cuda : more style fixes * Add assert in ggml_cuda_op_pool2d * pool2d float padding fallback * test-backend-ops : add dst_type to im2col --------- Co-authored-by: slaren <slarengh@gmail.com>
Diffstat (limited to 'examples')
-rw-r--r--examples/llava/MobileVLM-README.md58
1 files changed, 56 insertions, 2 deletions
diff --git a/examples/llava/MobileVLM-README.md b/examples/llava/MobileVLM-README.md
index c6258eba..9eba791d 100644
--- a/examples/llava/MobileVLM-README.md
+++ b/examples/llava/MobileVLM-README.md
@@ -111,17 +111,71 @@ llama_print_timings: eval time = 1279.03 ms / 18 runs ( 71.06 m
llama_print_timings: total time = 34570.79 ms
```
+## Orin compile and run
+### compile
+```sh
+make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_87 LLAMA_CUDA_F16=1 -j 32
+```
+
+### run on Orin
+### case 1
+**input**
+```sh
+./llava-cli \
+ -m /data/local/tmp/ggml-model-q4_k.gguf \
+ --mmproj /data/local/tmp/mmproj-model-f16.gguf \
+ --image /data/local/tmp/demo.jpeg \
+ -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWho is the author of this book? \nAnswer the question using a single word or phrase. ASSISTANT:" \
+ --n-gpu-layers 999
+```
+**output**
+```sh
+
+encode_image_with_clip: image encoded in 296.62 ms by CLIP ( 2.06 ms per image patch)
+
+ Susan Wise Bauer
+
+llama_print_timings: load time = 1067.64 ms
+llama_print_timings: sample time = 1.53 ms / 6 runs ( 0.25 ms per token, 3934.43 tokens per second)
+llama_print_timings: prompt eval time = 306.84 ms / 246 tokens ( 1.25 ms per token, 801.72 tokens per second)
+llama_print_timings: eval time = 91.50 ms / 6 runs ( 15.25 ms per token, 65.58 tokens per second)
+llama_print_timings: total time = 1352.63 ms / 252 tokens
+```
+
+### case 2
+**input**
+```sh
+./llava-cli \
+ -m /data/local/tmp/ggml-model-q4_k.gguf \
+ --mmproj /data/local/tmp/mmproj-model-f16.gguf \
+ -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat is in the image? ASSISTANT:" \
+ --n-gpu-layers 999
+
+```
+**output**
+```sh
+encode_image_with_clip: image encoded in 302.15 ms by CLIP ( 2.10 ms per image patch)
+
+ The image features a cat lying in the grass.
+
+llama_print_timings: load time = 1057.07 ms
+llama_print_timings: sample time = 3.27 ms / 11 runs ( 0.30 ms per token, 3360.83 tokens per second)
+llama_print_timings: prompt eval time = 213.60 ms / 232 tokens ( 0.92 ms per token, 1086.14 tokens per second)
+llama_print_timings: eval time = 166.65 ms / 11 runs ( 15.15 ms per token, 66.01 tokens per second)
+llama_print_timings: total time = 1365.47 ms / 243 tokens
+```
+
## Minor shortcomings
The `n_patch` of output in `ldp` is 1/4 of the input. In order to implement quickly, we uniformly modified `clip_n_patches` function to a quarter. when counting the time consumption, the calculated time will be 4 times bigger than the real cost.
## TODO
-- [ ] Support non-CPU backend for the new operators, such as `depthwise`, `hardswish`, `hardsigmoid`
+- [x] Support non-CPU backend for the new operators, such as `depthwise`, `hardswish`, `hardsigmoid`
- [ ] Optimize LDP projector performance
- Optimize the structure definition to avoid unnecessary memory rearrangements, to reduce the use of `ggml_permute_cpy`;
- Optimize operator implementation (ARM CPU/NVIDIA GPU): such as depthwise conv, hardswish, hardsigmoid, etc.
-- [ ] run MobileVLM on `Jetson Orin`
+- [x] run MobileVLM on `Jetson Orin`
- [ ] Support more model variants, such as `MobileVLM-3B`.