Ollama

Overview

Ollama is a local-first, OpenAI-compatible inference engine for running large language models on personal computers or servers. Use it through DeepIntShield with the same OpenAI-compatible request format, plus Ollama’s own configuration requirements. Key characteristics:

Local-first deployment - Run models locally or on private infrastructure
OpenAI API compatibility - Identical request/response format
Full feature support - Chat, text, embeddings, and streaming
Tool calling - Complete function definition and execution
Self-hosted - No external API dependency required

Supported Operations

Operation	Non-Streaming	Streaming	Endpoint
Chat Completions	✅	✅	`/v1/chat/completions`
Responses API	✅	✅	`/v1/chat/completions`
Text Completions	✅	✅	`/v1/completions`
Embeddings	✅	-	`/v1/embeddings`
List Models	✅	-	`/v1/models`
Image Generation	❌	❌	-
Speech (TTS)	❌	❌	-
Transcriptions (STT)	❌	❌	-
Files	❌	❌	-
Batch	❌	❌	-

1. Chat Completions

Request Parameters

Ollama supports all standard OpenAI chat completion parameters. For full parameter reference and behavior, see OpenAI Chat Completions.

The following parameters are not supported by Ollama and are ignored: prompt_cache_key, verbosity, store, service_tier.

Ollama supports all standard OpenAI message types, tools, responses, and streaming formats. For details on message handling, tools, responses, and streaming, refer to OpenAI Chat Completions.

2. Responses API

Ollama supports the Responses API with the same parameter support as Chat Completions.

3. Text Completions

Ollama supports legacy text completion format:

Parameter	Mapping
`prompt`	Direct pass-through
`max_tokens`	max_tokens
`temperature`, `top_p`	Direct pass-through
`stop`	Stop sequences

4. Embeddings

Ollama supports text embeddings:

Parameter	Notes
`input`	Text or array of texts
`model`	Embedding model name
`encoding_format`	”float” or “base64”
`dimensions`	Custom output dimensions (optional)

Response returns embedding vectors with token usage.

5. List Models

Lists models currently loaded in Ollama with capabilities and context information.

Unsupported Features

Feature	Reason
Speech/TTS	Not offered by Ollama API
Transcription/STT	Not offered by Ollama API
Batch Operations	Not offered by Ollama API
File Management	Not offered by Ollama API

Configuration

Ollama is self-hosted, so you must configure the BaseURL pointing to your Ollama instance (e.g., https://ollama.example.com). Once configured, call it through the gateway like any other provider:

curl -X POST https://app.deepintshield.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "x-bf-vk: $DEEPINTSHIELD_VIRTUAL_KEY" \
  -d '{
    "model": "ollama/llama3.1:latest",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Environment Setup:

Install Ollama from https://ollama.ai

Pull a model:

ollama pull llama3.1
ollama pull mistral
ollama pull neural-chat

Start Ollama server:
Terminal window
```
ollama serve
```

Verify it’s running:

curl https://ollama.example.com/api/tags

Performance Considerations

Streaming for Large Models: For better user experience with large models, use streaming:

{
  "model": "llama3.1:latest",
  "messages": [...],
  "stream": true
}

Token Context: Different models have different context windows:

Llama 3.1 70B: 128K tokens
Mistral 7B: 32K tokens
Neural Chat 7B: 8K tokens

GPU Acceleration: Ollama automatically uses GPU if available. For CPU-only, ensure timeout is sufficient.

Popular Models

Model	Size	Context	Speed
llama3.1:latest	Varies	128K	Fast
mistral:latest	7B	32K	Very Fast
neural-chat:latest	7B	8K	Very Fast
orca-mini:latest	3B	3K	Very Fast
openchat:latest	7B	8K	Very Fast

Caveats

BaseURL Configuration Required

Severity: High Behavior: BaseURL must be explicitly configured - no default Impact: Requests fail without proper configuration

Cache Control Stripped

Severity: Low Behavior: Cache control directives are removed from messages Impact: Prompt caching features don’t work

Parameter Filtering

Severity: Low Behavior: OpenAI-specific parameters filtered out Impact: prompt_cache_key, verbosity, store removed

User Field Size Limit

Severity: Low Behavior: User field > 64 characters silently dropped Impact: Longer user identifiers are lost