Serve multiple generative AI models¶

A company wants to deploy multiple large language models (LLMs) to serve different workloads. For example, they might want to deploy a Gemma3 model for a chatbot interface and a Deepseek model for a recommendation application. The company needs to ensure optimal serving performance for these LLMs. Using Gateway API Inference Extension, you can deploy these LLMs on your cluster with your chosen accelerator configuration in an InferencePool. You can then route requests based on the model name (such as "chatbot" and "recommender") and the Criticality property.

How¶

The following diagram illustrates how Gateway API Inference Extension routes requests to different models based on the model name. Serving multiple generative AI models

This example illustrates a conceptual example regarding how to use the HTTPRoute object to route based on model name like “chatbot” or “recommender” to InferencePool.

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: routes-to-llms
spec:
  parentRefs:
  - name: inference-gateway
  rules:
  - matches:
    - headers:
      - type: Exact
        #Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
        name: X-Gateway-Model-Name
        value: chatbot
      path:
        type: PathPrefix
        value: /
    backendRefs:
    - name: gemma3
      kind: InferencePool
  - matches:
    - headers:
      - type: Exact
        #Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
        name: X-Gateway-Model-Name
        value: recommender
      path:
        type: PathPrefix
        value: /
    backendRefs:
    - name: deepseek-r1
      kind: InferencePool

Try it out¶

Get the gateway IP:

IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=80

Send a few requests to model "chatbot" as follows:

curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
"model": "chatbot",
"prompt": "What is the color of the sky",
"max_tokens": 100,
"temperature": 0
}'

Send a few requests to model "recommender" as follows:

curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
"model": "chatbot",
"prompt": "Give me restaurant recommendations in Paris",
"max_tokens": 100,
"temperature": 0
}'