vLLM

Overview

vLLM is an OpenAI-compatible provider for self-hosted inference. Use it through DeepIntShield with the standard OpenAI-compatible request format. Key characteristics:

OpenAI compatibility - Chat, text completions, embeddings, rerank, and streaming
Self-hosted - Typically runs at https://vllm.example.com or your own server
Optional authentication - API key often omitted for local instances
Responses API - Supported via chat completion fallback

Supported Operations

Operation	Non-Streaming	Streaming	Endpoint
Chat Completions	✅	✅	`/v1/chat/completions`
Responses API	✅	✅	`/v1/chat/completions`
Text Completions	✅	✅	`/v1/completions`
Embeddings	✅	-	`/v1/embeddings`
Rerank	✅	-	`/v1/rerank` (fallback: `/rerank`)
List Models	✅	-	`/v1/models`
Image Generation	❌	❌	-
Speech (TTS)	❌	❌	-
Transcriptions (STT)	✅	✅	`/v1/audio/transcriptions`
Files	❌	❌	-
Batch	❌	❌	-

Authentication

API key: Optional. For local vLLM instances, the key is often left empty.
When set, the key is sent as Authorization: Bearer <key>.

Configuration

Base URL: Point this to your vLLM instance (e.g. https://vllm.example.com) via the provider network_config.base_url.
Model names: Depend on the models loaded in your vLLM instance (e.g. meta-llama/Llama-3.2-1B-Instruct, BAAI/bge-m3 for embeddings).

curl -X POST https://app.deepintshield.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "x-bf-vk: $DEEPINTSHIELD_VIRTUAL_KEY" \
  -d '{
    "model": "vllm/meta-llama/Llama-3.2-1B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Getting started

Run a vLLM server (Docker or pip). Example with Docker:

docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model meta-llama/Llama-3.2-1B-Instruct

Verify the server:
Terminal window
```
curl https://vllm.example.com/v1/models
```
Use DeepIntShield with model prefix vllm/<model_id> (e.g. vllm/meta-llama/Llama-3.2-1B-Instruct).

1. Chat Completions

vLLM supports standard OpenAI chat completion parameters. For full parameter reference, see OpenAI Chat Completions. Message types, tools, and streaming follow the same behavior.

2. Responses API

vLLM supports the Responses API with the same parameters as Chat Completions.

3. Text Completions

Parameter	Mapping
`prompt`	Sent as-is
`max_tokens`	max_tokens
`temperature`	temperature
`top_p`	top_p
`stop`	stop sequences

4. Embeddings

vLLM supports /v1/embeddings. Use model IDs exposed by your vLLM server (e.g. BAAI/bge-m3).

5. List Models

Lists models from your vLLM instance via /v1/models. Available models depend on what is loaded on the server.

6. Rerank

vLLM supports reranking for pooling/cross-encoder reranker models. DeepIntShield sends requests to /v1/rerank and automatically falls back to /rerank when required by your vLLM deployment.

curl -X POST https://app.deepintshield.com/v1/rerank \
  -H "Authorization: Bearer sk-bf-your-virtual-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vllm/BAAI/bge-reranker-v2-m3",
    "query": "What is machine learning?",
    "documents": [
      {"text": "Machine learning is a subset of AI."},
      {"text": "Python is a programming language."},
      {"text": "Deep learning uses neural networks."}
    ],
    "params": {
      "return_documents": true
    }
  }'

Caveats

Base URL must be configured

Severity: Low
Behavior: The base URL points to your vLLM instance (e.g. https://vllm.example.com).
Impact: For remote or custom ports, set network_config.base_url in the provider config.

Error responses with HTTP 200

Severity: Low
Behavior: vLLM may return HTTP 200 with an error payload (e.g. {"error": {"code": 404, "message": "..."}}) instead of 4xx/5xx.
Impact: DeepIntShield normalizes these into standard error responses so clients see consistent error handling.