Extend Inference Runtimes
TOC
Introduction
This document will guide you step-by-step on how to add new inference runtimes for
serving either Large Language Model (LLM) or any other models like "image classification",
"object detection", "text classification" etc.
Alauda AI comes with a builtin "vLLM" inference engine, with "custom inference runtimes",
you can introduce more inference engines like
Seldon MLServer,
Triton inference server and so on.
By introducing custom runtimes, you can expand the platform's support for a wider range of
model types and GPU types, and optimize performance for specific scenarios
to meet broader business needs.
In this section, we'll demonstrate extending current AI platform with a custom
XInfernece
serving runtime to deploy LLMs and serve an "OpenAI compatible API".
Scenarios
Consider extending your AI Platform inference service runtimes if you encounter any of the following situations:
- Support for New Model Types: Your model isn't natively supported by the current default inference runtime
vLLM.
- Compatibility with other types GPUs: You need to perform LLM inference on hardware equipped with GPUs like AMD or Huawei Ascend.
- Performance Optimization for Specific Scenarios: In certain inference scenarios, a new runtime (like Xinference) might offer better performance or resource utilization compared to existing runtimes.
- Custom Inference Logic: You need to introduce custom inference logic or dependent libraries that are difficult to implement within the existing default runtimes.
Prerequisites
Before you start, please ensure you meet these conditions:
- Your ACP cluster is deployed and running normally.
- Your AI Platform version is 1.3 or higher.
- You have the necessary inference runtime image(s) prepared. For example, for the Xinference runtime, images might look like
xprobe/xinference:v1.2.2 (for GPU) or xprobe/xinference:v1.2.2-cpu (for CPU).
- You have cluster administrator privileges (needed to create CRD instances).
Standard Workflow (Example: Xinference)
Follow these steps to extend the platform. We use Xinference as a baseline example to demonstrate the standard process.
Create Inference Runtime Resources
You'll need to create the corresponding inference runtime ClusterServingRuntime resources based on your target hardware environment (GPU/CPU/NPU).
-
Prepare the Runtime YAML Configuration:
Based on the type of runtime you want to add (e.g., Xinference) and your target hardware environment, prepare the appropriate YAML configuration file. Here are examples for the Xinference runtime across different hardware environments:
- GPU Runtime Example
# This is a sample YAML for Xinference GPU runtime
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
name: aml-xinference-cuda-12.1 # Name of the runtime resource
labels:
cpaas.io/runtime-class: xinference # required runtime type label
cpaas.io/accelerator-type: "nvidia"
cpaas.io/cuda-version: "12.1"
annotations:
cpaas.io/display-name: xinference-cuda-12.1 # Display name in the UI
spec:
containers:
- name: kserve-container
image: xprobe/xinference:v1.2.2 # Replace with your actual GPU runtime image
env:
# Required across all runtimes – path to the model directory
- name: MODEL_PATH
value: /mnt/models/{{ index .Annotations "aml-model-repo" }}
# The MODEL_UID parameter is optional for other runtimes.
- name: MODEL_UID
value: '{{ index .Annotations "aml-model-repo" }}'
# The MODEL_ENGINE parameter is required by the Xinference runtime, while it can be omitted for other runtimes.
- name: MODEL_ENGINE
value: "transformers"
# Required parameter for xinference runtime, please set it based on your model family, value: "llama" # e.g., "llama", "chatglm", etc.
- name: MODEL_FAMILY
value: ""
command:
- bash
- -c
- |
set +e
if [ "${MODEL_PATH}" == "" ]; then
echo "Need to set MODEL_PATH!"
exit 1
fi
if [ "${MODEL_ENGINE}" == "" ]; then
echo "Need to set MODEL_ENGINE!"
exit 1
fi
if [ "${MODEL_UID}" == "" ]; then
echo "Need to set MODEL_UID!"
exit 1
fi
if [ "${MODEL_FAMILY}" == "" ]; then
echo "Need to set MODEL_FAMILY!"
exit 1
fi
xinference-local --host 0.0.0.0 --port 8080 &
PID=$!
while [ true ];
do
curl http://127.0.0.1:8080/docs
if [ $? -eq 0 ]; then
break
else
echo "waiting xinference-local server to become ready..."
sleep 1
fi
done
set -e
xinference launch --model_path ${MODEL_PATH} --model-engine ${MODEL_ENGINE} -u ${MODEL_UID} -n ${MODEL_FAMILY} -e http://127.0.0.1:8080 $@
xinference list -e http://127.0.0.1:8080
echo "model load succeeded, waiting server process: ${PID}..."
wait ${PID}
# Add this line to use $@ in the script:
# see: https://unix.stackexchange.com/questions/144514/add-arguments-to-bash-c
- bash
resources:
limits:
cpu: 2
memory: 6Gi
requests:
cpu: 2
memory: 6Gi
startupProbe:
httpGet:
path: /docs
port: 8080
scheme: HTTP
failureThreshold: 60
periodSeconds: 10
timeoutSeconds: 10
supportedModelFormats:
- name: transformers # The model format supported by the runtime
version: "1"
- Tip: Make sure to replace the
image field value with the path to your actual prepared runtime image. You can also modify the annotations.cpaas.io/display-name field to customize the display name of the runtime in the AI Platform UI.
-
Apply the YAML File to Create the Resource:
From a terminal with cluster administrator privileges, execute the following command to apply your YAML file and create the inference runtime resource:
kubectl apply -f your-xinference-runtime.yaml
TIP
- Important Tip: Please refer to the examples above and create/configure the runtime based on your actual environment and inference needs. These examples are for reference only. You'll need to adjust parameters like the image, resource
limits, and requests to ensure the runtime is compatible with your model and hardware environment and runs efficiently.
- Note: You can only use this custom runtime on the inference service publishing page after the runtime resource has been created!
Publish Xinference Inference Service and Select the Runtime
Once the Xinference inference runtime resource is successfully created, you can select and configure it when publishing your LLM inference service on the AI Platform.
-
Configure Inference Framework for the Model:
Ensure that on the model details page of the model repository you are about to publish, you have selected the appropriate framework through the File Management metadata editing function. The framework parameter value chosen here must match a value included in the supportedModelFormats field when you created the inference service runtime. Please ensure the model framework parameter value is listed in the supportedModelFormats list set in the inference runtime.
-
Navigate to the Inference Service Publishing Page:
Log in to the AI Platform and navigate to the "Inference Services" or "Model Deployment" modules, then click "Publish Inference Service."
-
Select the Xinference Runtime:
In the inference service creation wizard, find the "Runtime" or "Inference Framework" option. From the dropdown menu or list, select the Xinference runtime you created in Step 1 (e.g., "Xinference CPU Runtime" or "Xinference GPU Runtime (CUDA)").
-
Set Environment Variables:
The Xinference runtime requires specific environment variables to function correctly. On the inference service configuration page, locate the "Environment Variables" or "More Settings" section and add the following environment variable:
Specific Runtime Examples
Once you understand the standard workflow, refer to these examples for specific configurations related to other runtimes.
MLServer
The MLServer runtime is versatile and can be used on both NVIDIA GPUs and CPUs.
kind: ClusterServingRuntime
apiVersion: serving.kserve.io/v1alpha1
metadata:
annotations:
cpaas.io/display-name: mlserver-cuda11.6-x86-arm
creationTimestamp: 2026-01-05T07:02:33Z
generation: 1
labels:
cpaas.io/accelerator-type: nvidia
cpaas.io/cuda-version: "11.6"
cpaas.io/runtime-class: mlserver
name: aml-mlserver-cuda-11.6
spec:
containers:
- command:
- /bin/bash
- -lc
- |
if [ "$MODEL_TYPE" = "text-to-image" ]; then
MODEL_IMPL="mlserver_diffusers.StableDiffusionRuntime"
else
MODEL_IMPL="mlserver_huggingface.HuggingFaceRuntime"
fi
MODEL_DIR="${MLSERVER_MODEL_URI}/${MLSERVER_MODEL_NAME}"
# a. using git lfs storage initializer, model will be in /mnt/models/<model_name>
# b. using hf storage initializer, model will be in /mnt/models
if [ ! -d "${MODEL_DIR}" ]; then
MODEL_DIR="${MLSERVER_MODEL_URI}"
echo "[WARNING] Model directory ${MODEL_DIR}/${MLSERVER_MODEL_NAME} not found, using ${MODEL_DIR} instead"
fi
export MLSERVER_MODEL_IMPLEMENTATION=${MODEL_IMPL}
export MLSERVER_MODEL_EXTRA="{\"task\":\"${MODEL_TYPE}\",\"pretrained_model\":\"${MODEL_DIR}\"}"
mlserver start $MLSERVER_MODEL_URI $@
- bash
env:
- name: MLSERVER_MODEL_URI
value: /mnt/models
- name: MLSERVER_MODEL_NAME
value: '{{ index .Annotations "aml-model-repo" }}'
- name: MODEL_TYPE
value: '{{ index .Annotations "aml-pipeline-tag" }}'
image: alaudadockerhub/seldon-mlserver:1.6.0-cu116-v1.3.1
name: kserve-container
resources:
limits:
cpu: 2
memory: 6Gi
requests:
cpu: 2
memory: 6Gi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
privileged: false
runAsNonRoot: true
runAsUser: 1000
startupProbe:
failureThreshold: 60
httpGet:
path: /v2/models/{{ index .Annotations "aml-model-repo" }}/ready
port: 8080
scheme: HTTP
periodSeconds: 10
timeoutSeconds: 10
labels:
modelClass: mlserver_sklearn.SKLearnModel
supportedModelFormats:
- name: mlflow
version: "1"
- name: transformers
version: "1"
MindIE (Ascend NPU 310P)
MindIE is specifically designed for Huawei Ascend hardware. Its configuration differs significantly in resource management and metadata.
1.ClusterServingRuntime
# This is a sample YAML for Ascend NPU runtime
kind: ClusterServingRuntime
apiVersion: serving.kserve.io/v1alpha1
metadata:
annotations:
cpaas.io/display-name: mindie-2.2RC1
labels:
cpaas.io/accelerator-type: npu
cpaas.io/cann-version: 8.3.0
cpaas.io/runtime-class: mindie
name: mindie-2.2rc1-310p
spec:
containers:
- command:
- bash
- -c
- |
REAL_SCRIPT=$(echo "$RAW_SCRIPT" | sed 's/__LT__/\x3c/g')
echo "$REAL_SCRIPT" > /tmp/startup.sh
chmod +x /tmp/startup.sh
CONFIG_FILE="${MODEL_PATH}/config.json"
echo "Checking for file: ${CONFIG_FILE}"
ls -ld "${MODEL_PATH}"
chmod -R 755 "${MODEL_PATH}"
echo "Fixing MODEL_PATH permission..."
ls -ld "${MODEL_PATH}"
/tmp/startup.sh --model-name "${MODEL_NAME}" --model-path "${MODEL_PATH}" --ip "${MY_POD_IP}"
env:
- name: RAW_SCRIPT
value: |
#!/bin/bash
#
# Copyright 2024 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
#
##
# Script Instruction
##
### Name:
### run_mindie.sh - Use to Start MindIE Service given a specific model
###
### Usage:
### bash run_mindie.sh --model-name xxx --model-path /path/to/model
###
### Required:
### --model-name :Given a model name to identify MindIE Service.
### --model-path :Given a model path which contain necessary files such as yaml/conf.json/tokenizer/vocab etc.
### Options:
### --help :Show this message.
### --ip :The IP address bound to the MindIE Server business plane RESTful interface,default value: 127.0.0.1.
### --port :The port bound to the MindIE Server business plane RESTful interface,default value: 1025.
### --management-ip :The IP address bound to the MindIE Server management plane RESTful interface,default value: 127.0.0.2.
### --management-port :The port bound to the MindIE Server management plane RESTful interface,default value: 1026.
### --metrics-port :The port bound to the performance indicator monitoring interface,default value: 1027.
### --max-seq-len :Maximum sequence length,default value: 2560.
### --max-iter-times :The global maximum output length of the model,default value: 512.
### --max-input-token-len :The maximum length of the token id,default value: 2048.
### --max-prefill-tokens :Each time prefill occurs, the total number of input tokens in the current batch,default value: 8192
### --truncation :Whether to perform parameter rationalization check interception,default value: false.
### --template-type :Reasoning type,default value: "Standard"
### --max-preempt-count :The upper limit of the maximum number of preemptible requests in each batch,default value: 0.
### --support-select-batch :Batch selection strategy,default value: false.
### --npu-mem-size :This can be used to apply for the upper limit of the KV Cache size in the NPU,default value: 8.
### --max-prefill-batch-size :The maximum prefill batch size,default value: 50.
### --world-size :Enable several cards for inference.
### 1. If it is not set, the parallel config in the YAML file is obtained by default. Set worldsize = dp*mp*pp.
### 2. If set, modify the parallel config in the YAML file. set parallel config: dp:1 mp:worldSize pp:1
### --ms-sched-host :MS Scheduler IP address,default value: 127.0.0.1.
### --ms-sched-port :MS Scheduler port,default value: 8119.
### For more details about config description, please check MindIE homepage: https://www.hiascend.com/document/detail/zh/mindie/10RC3/mindiellm/llmdev/mindie_llm0004.html
help() {
awk -F'### ' '/^###/ { print $2 }' "$0"
}
if [[ $# == 0 ]] || [[ "$1" == "--help" ]]; then
help
exit 1
fi
##
# Get device info
##
total_count=$(npu-smi info -l | grep "Total Count" | awk -F ':' '{print $2}' | xargs)
if [[ -z "$total_count" ]]; then
echo "Error: Unable to retrieve device info. Please check if npu-smi is available for current user (id 1001), or if you are specifying an occupied device."
exit 1
fi
echo "$total_count device(s) detected!"
##
# Set toolkit envs
##
echo "Setting toolkit envs..."
if [[ -f "/usr/local/Ascend/ascend-toolkit/set_env.sh" ]];then
source /usr/local/Ascend/ascend-toolkit/set_env.sh
else
echo "ascend-toolkit package is incomplete please check it."
exit 1
fi
echo "Toolkit envs set succeeded!"
##
# Set MindIE envs
##
echo "Setting MindIE envs..."
if [[ -f "/usr/local/Ascend/mindie/set_env.sh" ]];then
source /usr/local/Ascend/mindie/set_env.sh
else
echo "mindie package is incomplete please check it."
exit 1
fi
echo "MindIE envs set succeeded!"
##
# Default MS envs
##
# Set PYTHONPATH
MF_SCRIPTS_ROOT=$(realpath "$(dirname "$0")")
export PYTHONPATH=$MF_SCRIPTS_ROOT/../:$PYTHONPATH
##
# Receive args and modify config.json
##
export MIES_INSTALL_PATH=/usr/local/Ascend/mindie/latest/mindie-service
CONFIG_FILE=${MIES_INSTALL_PATH}/conf/config.json
echo "MindIE Service config path:$CONFIG_FILE"
#default config
BACKEND_TYPE="atb"
MAX_SEQ_LEN=2560
MAX_PREFILL_TOKENS=8192
MAX_ITER_TIMES=512
MAX_INPUT_TOKEN_LEN=2048
TRUNCATION=false
HTTPS_ENABLED=false
MULTI_NODES_INFER_ENABLED=false
NPU_MEM_SIZE=8
MAX_PREFILL_BATCH_SIZE=50
TEMPLATE_TYPE="Standard"
MAX_PREEMPT_COUNT=0
SUPPORT_SELECT_BATCH=false
IP_ADDRESS="127.0.0.1"
PORT=8080
MANAGEMENT_IP_ADDRESS="127.0.0.2"
MANAGEMENT_PORT=1026
METRICS_PORT=1027
#modify config
while [[ "$#" -gt 0 ]]; do
case $1 in
--model-path) MODEL_WEIGHT_PATH="$2"; shift ;;
--model-name) MODEL_NAME="$2"; shift ;;
--max-seq-len) MAX_SEQ_LEN="$2"; shift ;;
--max-iter-times) MAX_ITER_TIMES="$2"; shift ;;
--max-input-token-len) MAX_INPUT_TOKEN_LEN="$2"; shift ;;
--max-prefill-tokens) MAX_PREFILL_TOKENS="$2"; shift ;;
--truncation) TRUNCATION="$2"; shift ;;
--world-size) WORLD_SIZE="$2"; shift ;;
--template-type) TEMPLATE_TYPE="$2"; shift ;;
--max-preempt-count) MAX_PREEMPT_COUNT="$2"; shift ;;
--support-select-batch) SUPPORT_SELECT_BATCH="$2"; shift ;;
--npu-mem-size) NPU_MEM_SIZE="$2"; shift ;;
--max-prefill-batch-size) MAX_PREFILL_BATCH_SIZE="$2"; shift ;;
--ip) IP_ADDRESS="$2"; shift ;;
--port) PORT="$2"; shift ;;
--management-ip) MANAGEMENT_IP_ADDRESS="$2"; shift ;;
--management-port) MANAGEMENT_PORT="$2"; shift ;;
--metrics-port) METRICS_PORT="$2"; shift ;;
--ms-sched-host) ENV_MS_SCHED_HOST="$2"; shift ;;
--ms-sched-port) ENV_MS_SCHED_PORT="$2"; shift ;;
*)
echo "Unknown parameter: $1"
echo "Please check your inputs."
exit 1
;;
esac
shift
done
if [ -z "$MODEL_WEIGHT_PATH" ] || [ -z "$MODEL_NAME" ]; then
echo "Error: Both --model-path and --model-name are required."
exit 1
fi
MODEL_NAME=${MODEL_NAME:-$(basename "$MODEL_WEIGHT_PATH")}
echo "MODEL_NAME is set to: $MODEL_NAME"
WORLD_SIZE=$total_count
NPU_DEVICE_IDS=$(seq -s, 0 $(($WORLD_SIZE - 1)))
#validate config
if [[ "$BACKEND_TYPE" != "atb" ]]; then
echo "Error: BACKEND must be 'atb'. Current value: $BACKEND_TYPE"
exit 1
fi
if [[ ! "$IP_ADDRESS" =~ ^([0-9]{1,3}\.){3}[0-9]{1,3}$ ]] ||
[[ ! "$MANAGEMENT_IP_ADDRESS" =~ ^([0-9]{1,3}\.){3}[0-9]{1,3}$ ]]; then
echo "Error: IP_ADDRESS and MANAGEMENT_IP_ADDRESS must be valid IP addresses. Current values: IP_ADDRESS=$IP_ADDRESS, MANAGEMENT_IP_ADDRESS=$MANAGEMENT_IP_ADDRESS"
exit 1
fi
if [[ ! "$PORT" =~ ^[0-9]+$ ]] || (( PORT __LT__ 1025 || PORT > 65535 )) ||
[[ ! "$MANAGEMENT_PORT" =~ ^[0-9]+$ ]] || (( MANAGEMENT_PORT __LT__ 1025 || MANAGEMENT_PORT > 65535 )); then
echo "Error: PORT and MANAGEMENT_PORT must be integers between 1025 and 65535. Current values: PORT=$PORT, MANAGEMENT_PORT=$MANAGEMENT_PORT"
exit 1
fi
if [ "$MAX_PREFILL_TOKENS" -lt "$MAX_SEQ_LEN" ]; then
MAX_PREFILL_TOKENS=$MAX_SEQ_LEN
echo "MAX_PREFILL_TOKENS was less than MAX_SEQ_LEN. Setting MAX_PREFILL_TOKENS to $MAX_SEQ_LEN"
fi
MODEL_CONFIG_FILE="${MODEL_WEIGHT_PATH}/config.json"
if [ ! -f "$MODEL_CONFIG_FILE" ]; then
echo "Error: config.json file not found in $MODEL_WEIGHT_PATH."
exit 1
fi
chmod 600 "$MODEL_CONFIG_FILE"
#update config file
chmod u+w ${MIES_INSTALL_PATH}/conf/
sed -i "s/\"backendType\"\s*:\s*\"[^\"]*\"/\"backendType\": \"$BACKEND_TYPE\"/" $CONFIG_FILE
sed -i "s/\"modelName\"\s*:\s*\"[^\"]*\"/\"modelName\": \"$MODEL_NAME\"/" $CONFIG_FILE
sed -i "s|\"modelWeightPath\"\s*:\s*\"[^\"]*\"|\"modelWeightPath\": \"$MODEL_WEIGHT_PATH\"|" $CONFIG_FILE
sed -i "s/\"maxSeqLen\"\s*:\s*[0-9]*/\"maxSeqLen\": $MAX_SEQ_LEN/" "$CONFIG_FILE"
sed -i "s/\"maxPrefillTokens\"\s*:\s*[0-9]*/\"maxPrefillTokens\": $MAX_PREFILL_TOKENS/" "$CONFIG_FILE"
sed -i "s/\"maxIterTimes\"\s*:\s*[0-9]*/\"maxIterTimes\": $MAX_ITER_TIMES/" "$CONFIG_FILE"
sed -i "s/\"maxInputTokenLen\"\s*:\s*[0-9]*/\"maxInputTokenLen\": $MAX_INPUT_TOKEN_LEN/" "$CONFIG_FILE"
sed -i "s/\"truncation\"\s*:\s*[a-z]*/\"truncation\": $TRUNCATION/" "$CONFIG_FILE"
sed -i "s|\(\"npuDeviceIds\"\s*:\s*\[\[\)[^]]*\(]]\)|\1$NPU_DEVICE_IDS\2|" "$CONFIG_FILE"
sed -i "s/\"worldSize\"\s*:\s*[0-9]*/\"worldSize\": $WORLD_SIZE/" "$CONFIG_FILE"
sed -i "s/\"httpsEnabled\"\s*:\s*[a-z]*/\"httpsEnabled\": $HTTPS_ENABLED/" "$CONFIG_FILE"
sed -i "s/\"templateType\"\s*:\s*\"[^\"]*\"/\"templateType\": \"$TEMPLATE_TYPE\"/" $CONFIG_FILE
sed -i "s/\"maxPreemptCount\"\s*:\s*[0-9]*/\"maxPreemptCount\": $MAX_PREEMPT_COUNT/" $CONFIG_FILE
sed -i "s/\"supportSelectBatch\"\s*:\s*[a-z]*/\"supportSelectBatch\": $SUPPORT_SELECT_BATCH/" $CONFIG_FILE
sed -i "s/\"multiNodesInferEnabled\"\s*:\s*[a-z]*/\"multiNodesInferEnabled\": $MULTI_NODES_INFER_ENABLED/" "$CONFIG_FILE"
sed -i "s/\"maxPrefillBatchSize\"\s*:\s*[0-9]*/\"maxPrefillBatchSize\": $MAX_PREFILL_BATCH_SIZE/" "$CONFIG_FILE"
sed -i "s/\"ipAddress\"\s*:\s*\"[^\"]*\"/\"ipAddress\": \"$IP_ADDRESS\"/" "$CONFIG_FILE"
sed -i "s/\"port\"\s*:\s*[0-9]*/\"port\": $PORT/" "$CONFIG_FILE"
sed -i "s/\"managementIpAddress\"\s*:\s*\"[^\"]*\"/\"managementIpAddress\": \"$MANAGEMENT_IP_ADDRESS\"/" "$CONFIG_FILE"
sed -i "s/\"managementPort\"\s*:\s*[0-9]*/\"managementPort\": $MANAGEMENT_PORT/" "$CONFIG_FILE"
sed -i "s/\"metricsPort\"\s*:\s*[0-9]*/\"metricsPort\": $METRICS_PORT/" $CONFIG_FILE
sed -i "s/\"npuMemSize\"\s*:\s*-*[0-9]*/\"npuMemSize\": $NPU_MEM_SIZE/" "$CONFIG_FILE"
##
# Start service
##
echo "Current configurations are displayed as follows:"
cat $CONFIG_FILE
npu-smi info -m > ~/device_info
${MIES_INSTALL_PATH}/bin/mindieservice_daemon
- name: MODEL_NAME
value: '{{ index .Annotations "aml-model-repo" }}'
- name: MODEL_PATH
value: /mnt/models/{{ index .Annotations "aml-model-repo" }}
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
image: swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.2.RC1-300I-Duo-py311-openeuler24.03-lts
name: kserve-container
resources:
limits:
cpu: 2
memory: 6Gi
requests:
cpu: 2
memory: 6Gi
volumeMounts:
- mountPath: /dev/shm
name: dshm
startupProbe:
failureThreshold: 60
httpGet:
path: /v1/models
port: 8080
scheme: HTTP
periodSeconds: 10
timeoutSeconds: 180
supportedModelFormats:
- name: transformers
version: "1"
volumes:
- emptyDir:
medium: Memory
sizeLimit: 8Gi
name: dshm
2.Mandatory Annotations for InferenceService
Unlike other runtimes, MindIE must have annotations added to the InferenceService metadata during the final publishing step. This ensures the platform's scheduler correctly binds the NPU hardware to the service.
3.User Privileges (Root Access)
Due to the requirements of the Ascend driver and hardware abstraction layer, the MindIE image must run as the root user. Ensure your ClusterServingRuntime or InferenceService security context is configured accordingly:
Note: The MindIE ClusterServingRuntime YAML example above does not specify a securityContext, which means the container runs with the default settings of the image (typically root). Unlike MLServer which explicitly sets runAsNonRoot: true and runAsUser: 1000, MindIE requires root privileges to access the NPU hardware.
Comparison of Runtime Configurations
Before proceeding, refer to this table to understand the specific requirements for different runtimes: