Audio & OCR Configuration

SmartRAG provides capabilities for converting audio files to text and extracting text from images:

Whisper.net (Local Audio Transcription)

Whisper.net provides local, on-premise audio transcription with support for 99+ languages:

WhisperConfig Parameters

Parameter	Type	Default	Description
`ModelPath`	`string`	`"models/ggml-large-v3.bin"`	Path to Whisper model file
`DefaultLanguage`	`string`	`"auto"`	Language code for transcription
`MinConfidenceThreshold`	`double`	`0.3`	Minimum confidence score (0.0-1.0)
`PromptHint`	`string`	`""`	Context hint for better accuracy
`MaxThreads`	`int`	`0`	CPU threads (0 = auto-detect)
`ForceTranscribeOnly`	`bool`	`true`	When true, only transcribe in the source language; never translate to English.
`UseGpu`	`bool`	`false`	When true, use GPU if your application references a GPU runtime (CUDA on Windows/Linux, CoreML on macOS).

GPU acceleration

By default Whisper runs on CPU (package Whisper.net.Runtime). To use GPU, the application that references SmartRAG must add the matching Whisper.net runtime package and set WhisperConfig.UseGpu = true in configuration.

Windows (NVIDIA): In your project, add package Whisper.net.Runtime.Cuda.Windows, then set UseGpu = true in SmartRAG:WhisperConfig.
Linux (NVIDIA): Add Whisper.net.Runtime.Cuda.Linux, then set UseGpu = true.
macOS (Apple Silicon): Add Whisper.net.Runtime.CoreML, then set UseGpu = true. If you see Metal init errors, leave UseGpu = false.

SmartRAG does not reference GPU runtimes by default, so CPU-only deployments work without extra packages. Only add a GPU runtime if you want acceleration and your environment supports it.

Whisper Model Sizes

Model	Size	Speed	Accuracy	Use Case
`tiny`	75MB	⭐⭐⭐⭐⭐	⭐⭐	Fast prototyping
`base`	142MB	⭐⭐⭐⭐	⭐⭐⭐	Balanced performance
`small`	244MB	⭐⭐⭐	⭐⭐⭐⭐	Good accuracy
`medium`	769MB	⭐⭐	⭐⭐⭐⭐⭐	High accuracy
`large-v3`	1.5GB	⭐	⭐⭐⭐⭐⭐	Best accuracy

Model Download

Whisper.net automatically downloads GGML models from Hugging Face on first use. Models are saved to the path specified in ModelPath configuration:

Automatic Download:

Models are downloaded automatically when first used via WhisperGgmlDownloader
Downloaded from Hugging Face repository
Saved to the path specified in ModelPath (default: models/ggml-large-v3.bin)
No manual download required

Model Files:

Format: ggml-{model-name}.bin (e.g., ggml-base.bin, ggml-large-v3.bin)
Available models: tiny, base, small, medium, large-v3
First use downloads the model automatically (~5-10 minutes depending on connection and model size)

Configuration:

{
  "SmartRAG": {
    "WhisperConfig": {
      "ModelPath": "models/ggml-large-v3.bin"
    }
  }
}

Important Notes:

Whisper.net uses its own GGML model format and download system
This is independent of Ollama, LM Studio, or cloud services
Models are stored locally at the ModelPath location
For on-premise deployments, ensure the application has write access to the model directory
For cloud deployments, consider pre-downloading models or using persistent storage volumes

Transcribe vs Translate

Whisper supports two modes:

Transcribe: Output text in the same language as the speech.
Translate: Output always in English (SmartRAG never uses this mode).

SmartRAG always transcribes in the source language and never translates to English. When no language is specified (API and config use “auto”), Whisper auto-detects the language and outputs text in that language; we do not fall back to system locale, so server locale (e.g. “en”) is never forced. Use DefaultLanguage: "auto" for multi-language content; set a concrete code (e.g. "tr") only when you want to pin the language for all uploads. ForceTranscribeOnly (default true) documents that translation is disabled.

Configuration Example

{
  "SmartRAG": {
    "WhisperConfig": {
      "ModelPath": "models/ggml-large-v3.bin",
      "DefaultLanguage": "auto",
      "ForceTranscribeOnly": true,
      "MinConfidenceThreshold": 0.3,
      "PromptHint": "",
      "MaxThreads": 0
    }
  }
}

builder.Services.AddSmartRag(configuration, options =>
{
    options.WhisperConfig = new WhisperConfig
    {
        ModelPath = "models/ggml-large-v3.bin",
        DefaultLanguage = "auto",
        ForceTranscribeOnly = true,
        MinConfidenceThreshold = 0.3,
        PromptHint = "",
        MaxThreads = 0
    };
});

auto - Auto-detect language and transcribe in that language (recommended for multi-language content).
en - English
tr - Turkish
de - German
fr - French
es - Spanish
it - Italian
ru - Russian
ja - Japanese
ko - Korean
zh - Chinese
99+ languages supported

Usage Example

// Upload audio file
var document = await _documentService.UploadDocumentAsync(
    audioStream,
    "meeting-recording.mp3",
    "audio/mpeg",
    "user-id"
);

// Ask AI about audio file
var response = await _aiService.AskAsync(
    "What topics were discussed in this meeting?",
    "user-id"
);

Privacy First

Audio files are processed locally using Whisper.net. No data leaves your machine - perfect for GDPR/KVKK/HIPAA compliance.

OCR Configuration

Tesseract OCR enables text extraction from images and PDFs with support for 100+ languages:

Tesseract Language Support

// Specify language for OCR when uploading images
var document = await _documentService.UploadDocumentAsync(
    imageStream,
    "invoice.jpg",
    "image/jpeg",
    "user-id",
    language: "eng"  // English OCR
);

// Turkish OCR
language: "tur"

// Multi-language
language: "tur+eng"

Supported OCR Languages

eng - English
tur - Turkish
deu - German
fra - French
spa - Spanish
ita - Italian
rus - Russian
ara - Arabic
chi - Chinese
jpn - Japanese
kor - Korean
hin - Hindi
100+ languages supported

OCR Usage Examples

// Invoice analysis
var invoice = await _documentService.UploadDocumentAsync(
    invoiceStream,
    "invoice-2024-01.pdf",
    "application/pdf",
    "user-id",
    language: "eng"
);

var analysis = await _aiService.AskAsync(
    "What products are in this invoice and what is the total amount?",
    "user-id"
);

// ID card analysis
var idCard = await _documentService.UploadDocumentAsync(
    idCardStream,
    "id-card.jpg",
    "image/jpeg",
    "user-id",
    language: "eng"
);

var info = await _aiService.AskAsync(
    "What is the person's name and birth date on this ID card?",
    "user-id"
);

OCR Capabilities

✅ Works perfectly: Printed documents, scanned text, digital screenshots
⚠️ Limited support: Handwritten text (very low accuracy)
💡 Best results: High-quality scans of printed documents
🔒 100% On-Premise: No data sent to cloud - Tesseract runs on-premise

Supported File Formats

Audio Formats:

audio/mpeg - MP3 files
audio/wav - WAV files
audio/m4a - M4A files
audio/flac - FLAC files
audio/ogg - OGG files

Image Formats:

image/jpeg - JPEG images
image/png - PNG images
image/tiff - TIFF images
image/bmp - BMP images
image/gif - GIF images

PDF Formats:

application/pdf - PDF documents (page-by-page OCR)

Audio Quality Tips

Clear Audio: Avoid background noise and echo
Good Microphone: Use quality recording equipment
Correct Language: Specify the correct language of speech
File Format: MP3, WAV, M4A formats work best

OCR Quality Tips

High Resolution: At least 300 DPI scan quality
Clean Image: Avoid blurry or shadowy images
Correct Language: Specify the correct language of text in image
Contrast: Prefer high-contrast, black-and-white images

Audio and OCR Comparison

Compare Whisper.net and Tesseract OCR capabilities:

Feature	Whisper.net	Tesseract OCR
Data Privacy	100% On-premise	100% On-premise
Accuracy	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
Language Support	⭐⭐⭐⭐⭐ (99+ languages)	⭐⭐⭐⭐ (100+ languages)
Setup	⭐⭐⭐⭐	⭐⭐⭐⭐⭐
Cost	Free	Free
Performance	⭐⭐⭐⭐	⭐⭐⭐

Security and Privacy

Audio Security

// Whisper.net runs completely on-premise
var document = await _documentService.UploadDocumentAsync(
    sensitiveAudioStream,
    "confidential-meeting.mp3",
    "audio/mpeg",
    "user-id"
    // Data is never sent to cloud
);

OCR Security

// OCR runs completely on-premise
var document = await _documentService.UploadDocumentAsync(
    sensitiveImageStream,
    "confidential-document.jpg",
    "image/jpeg",
    "user-id",
    language: "eng"
    // Data is never sent to cloud
);

Next Steps

Advanced Configuration

Fallback providers and best practices

Advanced Configuration

Examples

Audio and OCR usage examples

View Examples