Skip to content

Ollama

Ollama is a local-first, OpenAI-compatible inference engine for running large language models on personal computers or servers. Use it through DeepIntShield with the same OpenAI-compatible request format, plus Ollama’s own configuration requirements. Key characteristics:

  • Local-first deployment - Run models locally or on private infrastructure
  • OpenAI API compatibility - Identical request/response format
  • Full feature support - Chat, text, embeddings, and streaming
  • Tool calling - Complete function definition and execution
  • Self-hosted - No external API dependency required
OperationNon-StreamingStreamingEndpoint
Chat Completions/v1/chat/completions
Responses API/v1/chat/completions
Text Completions/v1/completions
Embeddings-/v1/embeddings
List Models-/v1/models
Image Generation-
Speech (TTS)-
Transcriptions (STT)-
Files-
Batch-

Ollama supports all standard OpenAI chat completion parameters. For full parameter reference and behavior, see OpenAI Chat Completions.

The following parameters are not supported by Ollama and are ignored: prompt_cache_key, verbosity, store, service_tier.

Ollama supports all standard OpenAI message types, tools, responses, and streaming formats. For details on message handling, tools, responses, and streaming, refer to OpenAI Chat Completions.


Ollama supports the Responses API with the same parameter support as Chat Completions.


Ollama supports legacy text completion format:

ParameterMapping
promptDirect pass-through
max_tokensmax_tokens
temperature, top_pDirect pass-through
stopStop sequences

Ollama supports text embeddings:

ParameterNotes
inputText or array of texts
modelEmbedding model name
encoding_format”float” or “base64”
dimensionsCustom output dimensions (optional)

Response returns embedding vectors with token usage.


Lists models currently loaded in Ollama with capabilities and context information.


FeatureReason
Speech/TTSNot offered by Ollama API
Transcription/STTNot offered by Ollama API
Batch OperationsNot offered by Ollama API
File ManagementNot offered by Ollama API

Ollama is self-hosted, so you must configure the BaseURL pointing to your Ollama instance (e.g., https://ollama.example.com). Once configured, call it through the gateway like any other provider:

Terminal window
curl -X POST https://app.deepintshield.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "x-bf-vk: $DEEPINTSHIELD_VIRTUAL_KEY" \
-d '{
"model": "ollama/llama3.1:latest",
"messages": [{"role": "user", "content": "Hello"}]
}'

Environment Setup:

  1. Install Ollama from https://ollama.ai
  2. Pull a model:
    Terminal window
    ollama pull llama3.1
    ollama pull mistral
    ollama pull neural-chat
  3. Start Ollama server:
    Terminal window
    ollama serve
  4. Verify it’s running:
    Terminal window
    curl https://ollama.example.com/api/tags

Streaming for Large Models: For better user experience with large models, use streaming:

{
"model": "llama3.1:latest",
"messages": [...],
"stream": true
}

Token Context: Different models have different context windows:

  • Llama 3.1 70B: 128K tokens
  • Mistral 7B: 32K tokens
  • Neural Chat 7B: 8K tokens

GPU Acceleration: Ollama automatically uses GPU if available. For CPU-only, ensure timeout is sufficient.


ModelSizeContextSpeed
llama3.1:latestVaries128KFast
mistral:latest7B32KVery Fast
neural-chat:latest7B8KVery Fast
orca-mini:latest3B3KVery Fast
openchat:latest7B8KVery Fast

BaseURL Configuration Required

Severity: High Behavior: BaseURL must be explicitly configured - no default Impact: Requests fail without proper configuration

Cache Control Stripped

Severity: Low Behavior: Cache control directives are removed from messages Impact: Prompt caching features don’t work

Parameter Filtering

Severity: Low Behavior: OpenAI-specific parameters filtered out Impact: prompt_cache_key, verbosity, store removed

User Field Size Limit

Severity: Low Behavior: User field > 64 characters silently dropped Impact: Longer user identifiers are lost