Skip to content

Hugging Face

The Hugging Face provider lets you call models hosted across many inference backends (like hf-inference, fal-ai, cerebras, sambanova, etc.) through a single, unified DeepIntShield endpoint.

Through the Hugging Face provider you get:

  • Multiple inference backends: Route requests to 19+ different inference providers
  • Dynamic model aliasing: Address any backend model with a single composite model ID
  • Heterogeneous request formats: JSON, raw binary, and base64-encoded payloads are handled for you
  • Provider-specific constraints: Varying payload limits and format restrictions are enforced automatically

The Hugging Face provider supports routing to 20+ inference backends. Below is the current list of supported providers and their capabilities (as of December 2025):

ProviderChatEmbeddingSpeech (TTS)Transcription (ASR)Image GenerationImage Generation (stream)Image EditImage Edit (stream)
hf-inference
cerebras
cohere
fal-ai
featherless-ai
fireworks
groq
hyperbolic
nebius
novita
nscale
ovhcloud-ai-endpoints
public-ai
replicate
sambanova
scaleway
together
z-ai

Unlike standard providers where model IDs are direct strings (e.g., gpt-4), Hugging Face models in DeepIntShield are identified by a composite key to route requests to the correct inference backend.

Format: huggingface/[inference_provider]/[model_id]

  • inference_provider: The backend service (e.g., hf-inference, fal-ai, cerebras).
  • model_id: The actual model identifier on Hugging Face Hub (e.g., meta-llama/Meta-Llama-3-8B-Instruct).

Example: huggingface/hf-inference/meta-llama/Meta-Llama-3-8B-Instruct

DeepIntShield reads this model string to route each request to the correct inference backend automatically.

The Hugging Face provider handles various tasks (Chat, Speech, Transcription) which often require different request structures depending on the underlying inference provider.

Different inference providers have specific limitations and requirements:

HuggingFace API enforces a 2 MB request body limit across all request types (Chat, Embedding, Speech, Transcription). This constraint applies to:

  • JSON request payloads
  • Raw audio bytes in transcription requests
  • Any other request body data

Impact: Large audio files, extensive chat histories, or bulk embedding requests may need to be split or compressed before sending.

The fal-ai provider has strict audio format requirements:

  • Supported Format: Only MP3 (audio/mpeg) is accepted
  • Rejected Formats: WAV (audio/wav) and other formats are rejected with a clear error: fal-ai provider does not support audio/wav format; please use a different format like mp3 or ogg
  • Encoding: You can submit standard audio bytes; DeepIntShield base64-encodes them into the Data URI that fal-ai expects.

Text-to-Speech (TTS) requests use the standard DeepIntShield /v1/audio/speech endpoint. You provide the text and a huggingface/{provider}/{model} model ID; DeepIntShield builds the backend-specific request for you. No special pipeline tagging is required on your side, even when the model is tagged as text-to-speech on the Hub.

Transcription (Automatic Speech Recognition)

Section titled “Transcription (Automatic Speech Recognition)”

For transcription, the request format that reaches the backend differs depending on the inference provider you target. DeepIntShield handles this for you - you always send your audio to the standard /v1/audio/transcriptions endpoint.

For the standard hf-inference backend, the audio is forwarded as raw bytes.

  • Audio Format: Send the audio with its native mime type (e.g., audio/mpeg).
  • Payload Limit: Maximum 2 MB for the raw audio bytes.
  • What you do: Submit your audio file to /v1/audio/transcriptions as usual - no special encoding required.

For the fal-ai backend, the audio is forwarded as a base64-encoded Data URI.

  • Audio Format Restriction: Only MP3 (audio/mpeg) is supported. WAV files are rejected with a clear error.
  • What you do: Submit your MP3 audio file as usual; DeepIntShield base64-encodes it into the Data URI that fal-ai requires.

In both cases you use the same DeepIntShield request - the backend-specific encoding (raw bytes vs. base64 Data URI) is applied for you based on the inference provider in your model ID.

The Hugging Face provider supports image generation through multiple inference providers, each with different request formats and capabilities.

ProviderNon-StreamingStreamingNotes
hf-inferenceSimple prompt-only format, returns raw image bytes
fal-aiFull parameter support, supports streaming via Server-Sent Events
nebiusUses Nebius-specific format with width/height, LoRAs support
togetherOpenAI-compatible format

Send a standard /v1/images/generations request with the model ID huggingface/{provider}/{model_id}. The supported parameters depend on the backend:

The simplest backend, only requires a prompt:

  • Supported Input: prompt only.
  • Response: A generated image, returned as base64 in b64_json.
  • Limitations: No size, quality, or other parameter support.

The most feature-rich backend with extensive parameter support:

  • Supported Parameters: n, size (e.g., "1024x1024"), output_format ("jpg" is treated as "jpeg"), response_format, and moderation.
  • Extra Parameters: guidance_scale, acceleration, enable_prompt_expansion, enable_safety_checker.
  • Response: Images returned in data[] with url and/or b64_json.

Supports LoRAs (see the Nebius provider docs):

  • Supported Parameters: size (e.g., "1024x1024"), output_format, seed, negative_prompt.
  • Extra Parameters: num_inference_steps, guidance_scale, and loras (provided as a {"url": scale} map or [{"url": "...", "scale": ...}] array).
  • Response: Images returned in data[].

OpenAI-compatible:

  • Supported Parameters: prompt, model, response_format, size, n, and num_inference_steps.
  • Response: Images returned in data[].

Only fal-ai supports streaming for HuggingFace image generation. Streaming uses Server-Sent Events (SSE).

To stream, send the same image-generation request with "stream": true. The same parameters as non-streaming apply (prompt, response_format, n, size, etc.). You receive partial image chunks until a final completed chunk delivers the last image, as either a url or b64_json.

Terminal window
# fal-ai
curl -X POST https://app.deepintshield.com/v1/images/generations \
-H "Authorization: Bearer sk-bf-your-virtual-key" \
-H "Content-Type: application/json" \
-d '{
"model": "huggingface/fal-ai/fal-ai/flux/dev",
"prompt": "A futuristic cityscape at sunset",
"size": "1024x1024",
"n": 2,
"output_format": "png",
"response_format": "url"
}'
Terminal window
# Streaming (fal-ai only)
curl -X POST https://app.deepintshield.com/v1/images/generations \
-H "Authorization: Bearer sk-bf-your-virtual-key" \
-H "Content-Type: application/json" \
-d '{
"model": "huggingface/fal-ai/fal-ai/flux/dev",
"prompt": "A futuristic cityscape at sunset",
"size": "1024x1024",
"stream": true
}'
  • fal-ai:
    • output_format: "jpg" is treated as "jpeg"
    • Set moderation: "low" to disable the safety checker
  • nebius:
    • LoRAs can be provided as a {"url": scale} map or [{"url": "...", "scale": ...}] array
  • hf-inference:
    • Minimal format, only prompt supported; returns the image as base64
  • together:
    • OpenAI-compatible format

Only fal-ai supports image editing for HuggingFace. Image edit requests are routed to fal-ai inference provider.

Request Parameters

ParameterTypeRequiredNotes
modelstringModel identifier (must be huggingface/fal-ai/{model_id})
promptstringText description of the edit
image[]binaryImage file(s) to edit (supports multiple images for some models)
nintNumber of images to generate (1-10)
sizestringImage size: "WxH" format (e.g., "1024x1024")
output_formatstringOutput format: "png", "webp", "jpeg" (note: "jpg" is normalized to "jpeg")
seedintSeed for reproducibility (pass as an extra param)
num_inference_stepsintNumber of inference steps (pass as an extra param)
guidance_scalefloatGuidance scale (pass as an extra param)
accelerationstringAcceleration mode (pass as an extra param)
enable_safety_checkerboolEnable safety checker (pass as an extra param)
use_image_urlsboolOverride automatic image field selection (pass as an extra param)

Behavior

  • Supported Backend: Only the fal-ai inference provider supports image edit. Requests targeting other backends return an unsupported-operation error.
  • Image Handling: Submit your image file(s) directly. DeepIntShield detects the format (image/jpeg, image/webp, image/png) and encodes them into the form the backend expects.
  • Single vs. Multiple Images: For multi-image models (e.g., fal-ai/flux-2/edit, fal-ai/flux-2-pro/edit) you can pass several images; single-image models (e.g., fal-ai/flux-pro/kontext, fal-ai/flux/dev/image-to-image) accept one. The correct field is selected automatically, and you can override it with the use_image_urls extra param.
  • Normalization: output_format: "jpg" is normalized to "jpeg"; size is given in "WxH" form (e.g., "1024x1024").

Response

  • Non-streaming: Edited images are returned in data[] with url and/or b64_json, the same as image generation.
  • Streaming: Partial image chunks (type: "image_edit.partial_image") stream until a final type: "image_edit.completed" chunk delivers the last image, as either a url or b64_json.

Image Variation

Image variation is not supported by HuggingFace.

Different inference backends expect different request shapes for the same operation. You always use the standard DeepIntShield endpoints (for example /v1/embeddings for embeddings) regardless of backend - the backend-specific request format is handled for you.

DeepIntShield discovers available Hugging Face models for you via the Hugging Face Hub API.

When you call GET /v1/models, DeepIntShield:

  1. Queries multiple backends to gather available models
  2. Filters by capability (chat, embedding, text-to-speech, transcription, image) so each model is listed under the methods it supports
  3. Aggregates the results into a single unified list
  4. Returns model IDs in the huggingface/{provider}/{model_id} format you use in requests

DeepIntShield maps each Hugging Face model ID to the identifier the target backend expects (for example, meta-llama/Meta-Llama-3-8B-Instruct resolves to the right per-backend model name for cerebras, groq, and others). If a backend reports that a mapping is stale (HTTP 404), DeepIntShield refreshes it and retries automatically, so you don’t have to handle the retry yourself.

When working with the Hugging Face provider:

  1. Check Payload Size: Ensure request bodies are under 2 MB
  2. Audio Format: Use MP3 for fal-ai, avoid WAV files
  3. Model Aliases: Always specify the provider in the model string: huggingface/{provider}/{model}
  4. Provider Selection: Use the capability table above to pick a backend that supports your operation (chat, embedding, TTS, ASR, image)
  5. Verify Capabilities: Confirm the model supports your use case (chat, embedding, TTS, ASR, image) before sending requests