Age | Commit message (Collapse) | Author |
|
* Add RPC backend in device list to override tensors.
* rpc : prevent crashes on invalid input (#9040)
Add more checks which prevent RPC server from crashing if invalid input
is received from client
# Conflicts:
# ggml/src/ggml-rpc.cpp
* rpc : print error message when failed to connect endpoint (#9042)
* Fix RPC error
* Add vulkan, sycl to rpc backend
* add thread in rpc cpu backend
* add cache folder and other improvement in rpc
* add header file
* support for models with non-512 aligned tensors
* rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943)
RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.
The performance impact of this change depends on the network latency.
# Conflicts:
# ggml/src/ggml-rpc.cpp
* fix(rpc): Improve input validation and error handling (#13069)
* fix(rpc): Improve input validation and error handling
The `rpc-server` was vulnerable to Denial of Service attacks via
several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed
messages could trigger failed assertions (e.g., invalid `ggml_type`)
or out-of-bounds reads/writes leading to `GGML_ABORT` calls,
crashing the server process.
This PR introduces robust input validation and replaces `abort()`
calls with graceful error handling:
- **Type Validation:** `deserialize_tensor` now checks if the
`tensor->type` is within the valid `GGML_TYPE_COUNT` range
*before* calling `ggml_new_tensor_4d`. Returns `nullptr` on
invalid type.
- **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`,
`set_tensor_hash`, and `get_tensor` handlers with error
logging and returning `false` when data/offset parameters
are out of buffer bounds.
- **Size Checks:** Added safe arithmetic checks (for overflow) in
`graph_compute` when calculating required message sizes based
on client-provided `n_nodes` and `n_tensors`. Returns early
if the reported sizes conflict with the actual message size or
would lead to overflow.
- **Error Propagation:**
- `create_node` now checks for `nullptr` return values from
`deserialize_tensor` and its recursive calls, propagating
`nullptr` upwards on failure. Uses `find` instead of `at`
for safer map access.
- `copy_tensor` now checks for `nullptr` from `deserialize_tensor`
and sets the response status to failure if deserialization
or bounds checks fail.
- `graph_compute` now checks for `nullptr` return from
`create_node` and returns failure status correctly. The final
return value now reflects the actual computation status.
These changes improve the RPC server's resilience
against malformed client requests, preventing crashes and ensuring
errors are handled more gracefully.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): address pr comments
removed comments and unnecessary returns
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): ambiguous nullptr from create_node
rpc_server::create_node could previously return nullptr if the input ID
was 0 (valid) or if an internal error (deserialization, recursion
failure) occurred (invalid). This ambiguity made error handling
difficult for the caller (`graph_compute`).
This commit clarifies the meaning of nullptr:
- `graph_compute` now checks if the input 'id' was non-zero when
`create_node` returns nullptr, correctly identifying failures
versus intentional null links.
- `create_node` avoids recursive calls for zero IDs and propagates
nullptr unambiguously on failure during recursion.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): initial zero check in create_node
The caller (`graph_compute`) already checks `id != 0` when handling
a `nullptr` return from `create_node`, correctly distinguishing
intentional null links from actual errors. This makes the initial
`if (id == 0)` check redundant.
Also removes the log message when a tensor ID is not found in the
provided map which was added in this branch.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* fix(rpc): Handle get_alloc_size failure in server
Check the return value of `server.get_alloc_size` in the RPC server
loop. If the call fails, return early to close the connection.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): input size validation in graph_compute
Removes detailed, step-by-step size calculations and overflow
checks in favor of simpler direct comparisons, assuming 64-bit
overflow is unlikely.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): remove extra status code setting
Removes the explicit setting of `response.result = GGML_STATUS_FAILED`
when `create_node` returns `nullptr` within `graph_compute`.
Primary signal is the `false` return value in case of failure.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): remove redundant check for tensor->type
Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus
the check is not needed.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
---------
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
# Conflicts:
# ggml/src/ggml-rpc.cpp
* rpc : fix cache directory initialization (#13188)
Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
# Conflicts:
# examples/rpc/rpc-server.cpp
* rpc : avoid uninitialized memory in serialize_tensor (#13210)
Zero out the name and padding buffers.
* fix merge error
* Add hello command in RPC
* bug fix
* add rpc header
* fix bug for missing rpc names
* add tpc no delay for rpc
* add back webui
* fix rpc function not found error
---------
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>
Co-authored-by: matt23456 <matt23456>
Co-authored-by: Ville Vesilehto <ville@vesilehto.fi>
Co-authored-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: Justin Santa Barbara <justinsb@google.com>
|
|
This reverts commit 8a5f8573aefc23282200041abbfa12886083334a.
|
|
* Add RPC backend in device list to override tensors.
* rpc : prevent crashes on invalid input (#9040)
Add more checks which prevent RPC server from crashing if invalid input
is received from client
# Conflicts:
# ggml/src/ggml-rpc.cpp
* rpc : print error message when failed to connect endpoint (#9042)
* Fix RPC error
* Add vulkan, sycl to rpc backend
* add thread in rpc cpu backend
* add cache folder and other improvement in rpc
* add header file
* support for models with non-512 aligned tensors
* rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943)
RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.
The performance impact of this change depends on the network latency.
# Conflicts:
# ggml/src/ggml-rpc.cpp
* fix(rpc): Improve input validation and error handling (#13069)
* fix(rpc): Improve input validation and error handling
The `rpc-server` was vulnerable to Denial of Service attacks via
several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed
messages could trigger failed assertions (e.g., invalid `ggml_type`)
or out-of-bounds reads/writes leading to `GGML_ABORT` calls,
crashing the server process.
This PR introduces robust input validation and replaces `abort()`
calls with graceful error handling:
- **Type Validation:** `deserialize_tensor` now checks if the
`tensor->type` is within the valid `GGML_TYPE_COUNT` range
*before* calling `ggml_new_tensor_4d`. Returns `nullptr` on
invalid type.
- **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`,
`set_tensor_hash`, and `get_tensor` handlers with error
logging and returning `false` when data/offset parameters
are out of buffer bounds.
- **Size Checks:** Added safe arithmetic checks (for overflow) in
`graph_compute` when calculating required message sizes based
on client-provided `n_nodes` and `n_tensors`. Returns early
if the reported sizes conflict with the actual message size or
would lead to overflow.
- **Error Propagation:**
- `create_node` now checks for `nullptr` return values from
`deserialize_tensor` and its recursive calls, propagating
`nullptr` upwards on failure. Uses `find` instead of `at`
for safer map access.
- `copy_tensor` now checks for `nullptr` from `deserialize_tensor`
and sets the response status to failure if deserialization
or bounds checks fail.
- `graph_compute` now checks for `nullptr` return from
`create_node` and returns failure status correctly. The final
return value now reflects the actual computation status.
These changes improve the RPC server's resilience
against malformed client requests, preventing crashes and ensuring
errors are handled more gracefully.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): address pr comments
removed comments and unnecessary returns
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): ambiguous nullptr from create_node
rpc_server::create_node could previously return nullptr if the input ID
was 0 (valid) or if an internal error (deserialization, recursion
failure) occurred (invalid). This ambiguity made error handling
difficult for the caller (`graph_compute`).
This commit clarifies the meaning of nullptr:
- `graph_compute` now checks if the input 'id' was non-zero when
`create_node` returns nullptr, correctly identifying failures
versus intentional null links.
- `create_node` avoids recursive calls for zero IDs and propagates
nullptr unambiguously on failure during recursion.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): initial zero check in create_node
The caller (`graph_compute`) already checks `id != 0` when handling
a `nullptr` return from `create_node`, correctly distinguishing
intentional null links from actual errors. This makes the initial
`if (id == 0)` check redundant.
Also removes the log message when a tensor ID is not found in the
provided map which was added in this branch.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* fix(rpc): Handle get_alloc_size failure in server
Check the return value of `server.get_alloc_size` in the RPC server
loop. If the call fails, return early to close the connection.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): input size validation in graph_compute
Removes detailed, step-by-step size calculations and overflow
checks in favor of simpler direct comparisons, assuming 64-bit
overflow is unlikely.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): remove extra status code setting
Removes the explicit setting of `response.result = GGML_STATUS_FAILED`
when `create_node` returns `nullptr` within `graph_compute`.
Primary signal is the `false` return value in case of failure.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): remove redundant check for tensor->type
Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus
the check is not needed.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
---------
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
# Conflicts:
# ggml/src/ggml-rpc.cpp
* rpc : fix cache directory initialization (#13188)
Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
# Conflicts:
# examples/rpc/rpc-server.cpp
* rpc : avoid uninitialized memory in serialize_tensor (#13210)
Zero out the name and padding buffers.
* fix merge error
* Add hello command in RPC
* bug fix
* add rpc header
* fix bug for missing rpc names
* add tpc no delay for rpc
* add back webui
---------
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>
Co-authored-by: matt23456 <matt23456>
Co-authored-by: Ville Vesilehto <ville@vesilehto.fi>
Co-authored-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: Justin Santa Barbara <justinsb@google.com>
|
|
* update webui
* add token/s in webui
* add webui files
* fix webui first message disappear in some browser
* add missing html files
---------
Co-authored-by: firecoperana <firecoperana>
|
|
* Merge mainline
* Fix after merge
* Remove CI check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* Merging mainline - WIP
* Merging mainline - WIP
AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.
* Merging mainline - fix Metal
* Remove check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
|
* server : Smart selection of available slot using Longest Common Substring
* add usage
* remove trailing whitespaces
* Use Longest Common Prefix (LCP) instead of LCS
* Rename argument
|
|
* common : gpt_params_parse do not print usage
* common : rework usage print (wip)
* common : valign
* common : rework print_usage
* infill : remove cfg support
* common : reorder args
* server : deduplicate parameters
ggml-ci
* common : add missing header
ggml-ci
* common : remote --random-prompt usages
ggml-ci
* examples : migrate to gpt_params
ggml-ci
* batched-bench : migrate to gpt_params
* retrieval : migrate to gpt_params
* common : change defaults for escape and n_ctx
* common : remove chatml and instruct params
ggml-ci
* common : passkey use gpt_params
|
|
* change default temperature of OAI compat API from 0 to 1
* make tests explicitly send temperature to OAI API
|
|
|
|
|
|
* Support Llama 3 conversion
The tokenizer is BPE.
* style
* Accept suggestion
Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
* llama : add llama_token_is_eog()
ggml-ci
* llama : auto-detect more EOT tokens when missing in KV data
* convert : replacing EOS token is a hack
* llama : fix codegemma EOT token + add TODOs
* llama : fix model type string for 8B model
---------
Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
usage in stream OAI response (#6495)
* ci: bench: support sse and fix prompt processing time
server: add tokens usage in stream mode
* ci: bench: README.md EOL
* ci: bench: remove total pp and tg as it is not accurate
* ci: bench: fix case when there is no token generated
* ci: bench: change to the 95 percentile for pp and tg as it is closer to what the server exports in metrics
* ci: bench: fix finish reason rate
|
|
Co-authored-by: Jonas Holzner <jonas.holzner.external@hensoldt.net>
|
|
* server: clean up oai parsing function
* fix response_format
* fix empty response_format
* minor fixes
* add TODO for logprobs
* update docs
|
|
|
|
* json: ordered json in server/schema converter to respect orig order
* json: ws nits
* json: support non-string const / enums
|
|
* json: fix arrays (disallow `[,1]`)
* json: support tuple types (`[number, string]`)
* json: support additionalProperties (`{[k: string]: [string,number][]}`)
* json: support required / optional properties
* json: add support for pattern
* json: resolve $ref (and support https schema urls)
* json: fix $ref resolution
* join: support union types (mostly for nullable types I think)
* json: support allOf + nested anyOf
* json: support any (`{}` or `{type: object}`)
* json: fix merge
* json: temp fix for escapes
* json: spaces in output and unrestricted output spaces
* json: add typings
* json:fix typo
* Create ts-type-to-grammar.sh
* json: fix _format_literal (json.dumps already escapes quotes)
* json: merge lit sequences and handle negatives
{"type": "string", "pattern": "^({\"question\": \"[^\"]+\", \"response\": \"[^\"]+\"}\\n)+$"}
* json: handle pattern repetitions
* Update json-schema-to-grammar.mjs
* Create regex-to-grammar.py
* json: extract repeated regexp patterns to subrule
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* json: handle schema from pydantic Optional fields
* Update json-schema-to-grammar.py
* Update json-schema-to-grammar.py
* Update ts-type-to-grammar.sh
* Update ts-type-to-grammar.sh
* json: simplify nullable fields handling
* json: accept duplicate identical rules
* json: revert space to 1 at most
* json: reuse regexp pattern subrules
* json: handle uuid string format
* json: fix literal escapes
* json: add --allow-fetch
* json: simplify range escapes
* json: support negative ranges in patterns
* Delete commit.txt
* json: custom regex parser, adds dot support & JS-portable
* json: rm trailing spaces
* Update json-schema-to-grammar.mjs
* json: updated server & chat `( cd examples/server && ./deps.sh )`
* json: port fixes from mjs to python
* Update ts-type-to-grammar.sh
* json: support prefixItems alongside array items
* json: add date format + fix uuid
* json: add date, time, date-time formats
* json: preserve order of props from TS defs
* json: port schema converter to C++, wire in ./server
* json: nits
* Update json-schema-to-grammar.cpp
* Update json-schema-to-grammar.cpp
* Update json-schema-to-grammar.cpp
* json: fix mjs implementation + align outputs
* Update json-schema-to-grammar.mjs.hpp
* json: test C++, JS & Python versions
* json: nits + regen deps
* json: cleanup test
* json: revert from c++17 to 11
* json: nit fixes
* json: dirty include for test
* json: fix zig build
* json: pass static command to std::system in tests (fixed temp files)
* json: fix top-level $refs
* json: don't use c++20 designated initializers
* nit
* json: basic support for reserved names `{number:{number:{root:number}}}`
* Revamp test cmake to allow args (WORKING_DIRECTORY needed for JSON test)
* json: re-ran server deps.sh
* json: simplify test
* json: support mix of additional props & required/optional
* json: add tests for some expected failures
* json: fix type=const in c++, add failure expectations for non-str const&enum
* json: test (& simplify output of) empty schema
* json: check parsing in test + fix value & string refs
* json: add server tests for OAI JSON response_format
* json: test/fix top-level anyOf
* json: improve grammar parsing failures
* json: test/fix additional props corner cases
* json: fix string patterns (was missing quotes)
* json: ws nit
* json: fix json handling in server when there's no response_format
* json: catch schema conversion errors in server
* json: don't complain about unknown format type in server if unset
* json: cleaner build of test
* json: create examples/json-schema-pydantic-example.py
* json: fix date pattern
* json: move json.hpp & json-schema-to-grammar.{cpp,h} to common
* json: indent 4 spaces
* json: fix naming of top-level c++ function (+ drop unused one)
* json: avoid using namespace std
* json: fix zig build
* Update server.feature
* json: iostream -> fprintf
* json: space before & refs for consistency
* json: nits
|
|
|
|
* use multitask for embd endpoint
* specify types
* remove redundant {"n_predict", 0}
|
|
* server: format error to json
* server: do not crash on grammar error
* fix api key test case
* revert limit max n_predict
* small fix
* correct coding style
* update completion.js
* launch_slot_with_task
* update docs
* update_slots
* update webui
* update readme
|
|
* server: maintain chat completion id for streaming responses
* Update examples/server/utils.hpp
* Update examples/server/utils.hpp
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
* server : refactoring (wip)
* server : remove llava/clip objects from build
* server : fix empty prompt handling + all slots idle logic
* server : normalize id vars
* server : code style
* server : simplify model chat template validation
* server : code style
* server : minor
* llama : llama_chat_apply_template support null buf
* server : do not process embedding requests when disabled
* server : reorganize structs and enums + naming fixes
* server : merge oai.hpp in utils.hpp
* server : refactor system prompt update at start
* server : disable cached prompts with self-extend
* server : do not process more than n_batch tokens per iter
* server: tests: embeddings use a real embeddings model (#5908)
* server, tests : bump batch to fit 1 embedding prompt
* server: tests: embeddings fix build type Debug is randomly failing (#5911)
* server: tests: embeddings, use different KV Cache size
* server: tests: embeddings, fixed prompt do not exceed n_batch, increase embedding timeout, reduce number of concurrent embeddings
* server: tests: embeddings, no need to wait for server idle as it can timout
* server: refactor: clean up http code (#5912)
* server : avoid n_available var
ggml-ci
* server: refactor: better http codes
* server : simplify json parsing + add comment about t_last
* server : rename server structs
* server : allow to override FQDN in tests
ggml-ci
* server : add comments
---------
Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com>
|
|
* server: tests: add models endpoint scenario
* server: /v1/models add some metadata
* server: tests: add debug field in context before scenario
* server: tests: download model from HF, add batch size
* server: tests: add passkey test
* server: tests: add group attention params
* server: do not truncate prompt tokens if self-extend through group attention is enabled
* server: logs: do not truncate log values
* server: tests - passkey - first good working value of nga
* server: tests: fix server timeout
* server: tests: fix passkey, add doc, fix regex content matching, fix timeout
* server: tests: fix regex content matching
* server: tests: schedule slow tests on master
* server: metrics: fix when no prompt processed
* server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1
* server: tests: increase timeout for completion
* server: tests: keep only the PHI-2 test
* server: tests: passkey add a negative test
|
|
* server: normalize naming
* fix spacing
|
|
* server: logs - always use JSON logger, add add thread_id in message, log task_id and slot_id
* server : skip GH copilot requests from logging
* server : change message format of server_log()
* server : no need to repeat log in comment
* server : log style consistency
* server : fix compile warning
* server : fix tests regex patterns on M2 Ultra
* server: logs: PR feedback on log level
* server: logs: allow to choose log format in json or plain text
* server: tests: output server logs in text
* server: logs switch init logs to server logs macro
* server: logs ensure value json value does not raised error
* server: logs reduce level VERBOSE to VERB to max 4 chars
* server: logs lower case as other log messages
* server: logs avoid static in general
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server: logs PR feedback: change text log format to: LEVEL [function_name] message | additional=data
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
endpoint (#5708)
* server: monitoring - add /metrics prometheus compatible endpoint
* server: concurrency issue, when 2 task are waiting for results, only one call thread is notified
* server: metrics - move to a dedicated struct
|
|
* server: health: fix race condition on slots data using tasks queue
* server: health:
* include_slots only if slots_endpoint
* fix compile warning task.target_id not initialized.
|
|
* server: use llama_chat_apply_template
* server: remove trailing space
* server: fix format_chat
* server: fix help message
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server: fix formatted_chat
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
|
This updates the server queue to support graceful shutdown of the server on signals.
|
|
* server: add mistral chat template
* server: fix typo
* server: rename template mistral to llama2
* server: format_llama2: remove BOS
* server: validate "--chat-template" argument
* server: clean up using_chatml variable
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
---------
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
|
|
|
|
* server: add llama_server_queue struct
* server: add llama_server_response_event
* server: add comments
* server: move all mutexes away from server.cpp
* server: correct multitask response
* server: only add back deferred tasks when one slot is available
* server: fix a race condition cause by "request_completion"
|