Audio & OCR Configuration

SmartRAG provides capabilities for converting audio files to text and extracting text from images:

Whisper.net (Local Audio Transcription)

Whisper.net provides local, on-premise audio transcription with support for 99+ languages:

WhisperConfig Parameters

Parameter Type Default Description
ModelPath string "models/ggml-large-v3.bin" Path to Whisper model file
DefaultLanguage string "auto" Language code for transcription
MinConfidenceThreshold double 0.3 Minimum confidence score (0.0-1.0)
PromptHint string "" Context hint for better accuracy
MaxThreads int 0 CPU threads (0 = auto-detect)
ForceTranscribeOnly bool true When true, only transcribe in the source language; never translate to English.
UseGpu bool false When true, use GPU if your application references a GPU runtime (CUDA on Windows/Linux, CoreML on macOS).

GPU acceleration

By default Whisper runs on CPU (package Whisper.net.Runtime). To use GPU, the application that references SmartRAG must add the matching Whisper.net runtime package and set WhisperConfig.UseGpu = true in configuration.

  • Windows (NVIDIA): In your project, add package Whisper.net.Runtime.Cuda.Windows, then set UseGpu = true in SmartRAG:WhisperConfig.
  • Linux (NVIDIA): Add Whisper.net.Runtime.Cuda.Linux, then set UseGpu = true.
  • macOS (Apple Silicon): Add Whisper.net.Runtime.CoreML, then set UseGpu = true. If you see Metal init errors, leave UseGpu = false.

SmartRAG does not reference GPU runtimes by default, so CPU-only deployments work without extra packages. Only add a GPU runtime if you want acceleration and your environment supports it.

Whisper Model Sizes

Model Size Speed Accuracy Use Case
tiny 75MB ⭐⭐⭐⭐⭐ ⭐⭐ Fast prototyping
base 142MB ⭐⭐⭐⭐ ⭐⭐⭐ Balanced performance
small 244MB ⭐⭐⭐ ⭐⭐⭐⭐ Good accuracy
medium 769MB ⭐⭐ ⭐⭐⭐⭐⭐ High accuracy
large-v3 1.5GB ⭐⭐⭐⭐⭐ Best accuracy

Model Download

Whisper.net automatically downloads GGML models from Hugging Face on first use. Models are saved to the path specified in ModelPath configuration:

Automatic Download:

  • Models are downloaded automatically when first used via WhisperGgmlDownloader
  • Downloaded from Hugging Face repository
  • Saved to the path specified in ModelPath (default: models/ggml-large-v3.bin)
  • No manual download required

Model Files:

  • Format: ggml-{model-name}.bin (e.g., ggml-base.bin, ggml-large-v3.bin)
  • Available models: tiny, base, small, medium, large-v3
  • First use downloads the model automatically (~5-10 minutes depending on connection and model size)

Configuration:

{
  "SmartRAG": {
    "WhisperConfig": {
      "ModelPath": "models/ggml-large-v3.bin"
    }
  }
}

Important Notes:

  • Whisper.net uses its own GGML model format and download system
  • This is independent of Ollama, LM Studio, or cloud services
  • Models are stored locally at the ModelPath location
  • For on-premise deployments, ensure the application has write access to the model directory
  • For cloud deployments, consider pre-downloading models or using persistent storage volumes

Transcribe vs Translate

Whisper supports two modes:

  • Transcribe: Output text in the same language as the speech.
  • Translate: Output always in English (SmartRAG never uses this mode).

SmartRAG always transcribes in the source language and never translates to English. When no language is specified (API and config use “auto”), Whisper auto-detects the language and outputs text in that language; we do not fall back to system locale, so server locale (e.g. “en”) is never forced. Use DefaultLanguage: "auto" for multi-language content; set a concrete code (e.g. "tr") only when you want to pin the language for all uploads. ForceTranscribeOnly (default true) documents that translation is disabled.

Configuration Example

{
  "SmartRAG": {
    "WhisperConfig": {
      "ModelPath": "models/ggml-large-v3.bin",
      "DefaultLanguage": "auto",
      "ForceTranscribeOnly": true,
      "MinConfidenceThreshold": 0.3,
      "PromptHint": "",
      "MaxThreads": 0
    }
  }
}
builder.Services.AddSmartRag(configuration, options =>
{
    options.WhisperConfig = new WhisperConfig
    {
        ModelPath = "models/ggml-large-v3.bin",
        DefaultLanguage = "auto",
        ForceTranscribeOnly = true,
        MinConfidenceThreshold = 0.3,
        PromptHint = "",
        MaxThreads = 0
    };
});
  • auto - Auto-detect language and transcribe in that language (recommended for multi-language content).
  • en - English
  • tr - Turkish
  • de - German
  • fr - French
  • es - Spanish
  • it - Italian
  • ru - Russian
  • ja - Japanese
  • ko - Korean
  • zh - Chinese
  • 99+ languages supported

Usage Example

// Upload audio file
var document = await _documentService.UploadDocumentAsync(
    audioStream,
    "meeting-recording.mp3",
    "audio/mpeg",
    "user-id"
);

// Ask AI about audio file
var response = await _aiService.AskAsync(
    "What topics were discussed in this meeting?",
    "user-id"
);

Privacy First

Audio files are processed locally using Whisper.net. No data leaves your machine - perfect for GDPR/KVKK/HIPAA compliance.

OCR Configuration

Tesseract OCR enables text extraction from images and PDFs with support for 100+ languages:

Tesseract Language Support

// Specify language for OCR when uploading images
var document = await _documentService.UploadDocumentAsync(
    imageStream,
    "invoice.jpg",
    "image/jpeg",
    "user-id",
    language: "eng"  // English OCR
);

// Turkish OCR
language: "tur"

// Multi-language
language: "tur+eng"

Supported OCR Languages

  • eng - English
  • tur - Turkish
  • deu - German
  • fra - French
  • spa - Spanish
  • ita - Italian
  • rus - Russian
  • ara - Arabic
  • chi - Chinese
  • jpn - Japanese
  • kor - Korean
  • hin - Hindi
  • 100+ languages supported

OCR Usage Examples

// Invoice analysis
var invoice = await _documentService.UploadDocumentAsync(
    invoiceStream,
    "invoice-2024-01.pdf",
    "application/pdf",
    "user-id",
    language: "eng"
);

var analysis = await _aiService.AskAsync(
    "What products are in this invoice and what is the total amount?",
    "user-id"
);

// ID card analysis
var idCard = await _documentService.UploadDocumentAsync(
    idCardStream,
    "id-card.jpg",
    "image/jpeg",
    "user-id",
    language: "eng"
);

var info = await _aiService.AskAsync(
    "What is the person's name and birth date on this ID card?",
    "user-id"
);

OCR Capabilities

OCR Capabilities

  • ✅ Works perfectly: Printed documents, scanned text, digital screenshots
  • ⚠️ Limited support: Handwritten text (very low accuracy)
  • 💡 Best results: High-quality scans of printed documents
  • 🔒 100% On-Premise: No data sent to cloud - Tesseract runs on-premise

Supported File Formats

Audio Formats:

  • audio/mpeg - MP3 files
  • audio/wav - WAV files
  • audio/m4a - M4A files
  • audio/flac - FLAC files
  • audio/ogg - OGG files

Image Formats:

  • image/jpeg - JPEG images
  • image/png - PNG images
  • image/tiff - TIFF images
  • image/bmp - BMP images
  • image/gif - GIF images

PDF Formats:

  • application/pdf - PDF documents (page-by-page OCR)

Audio Quality Tips

  1. Clear Audio: Avoid background noise and echo
  2. Good Microphone: Use quality recording equipment
  3. Correct Language: Specify the correct language of speech
  4. File Format: MP3, WAV, M4A formats work best

OCR Quality Tips

  1. High Resolution: At least 300 DPI scan quality
  2. Clean Image: Avoid blurry or shadowy images
  3. Correct Language: Specify the correct language of text in image
  4. Contrast: Prefer high-contrast, black-and-white images

Audio and OCR Comparison

Compare Whisper.net and Tesseract OCR capabilities:

Feature Whisper.net Tesseract OCR
Data Privacy 100% On-premise 100% On-premise
Accuracy ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
Language Support ⭐⭐⭐⭐⭐ (99+ languages) ⭐⭐⭐⭐ (100+ languages)
Setup ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
Cost Free Free
Performance ⭐⭐⭐⭐ ⭐⭐⭐

Security and Privacy

Audio Security

// Whisper.net runs completely on-premise
var document = await _documentService.UploadDocumentAsync(
    sensitiveAudioStream,
    "confidential-meeting.mp3",
    "audio/mpeg",
    "user-id"
    // Data is never sent to cloud
);

OCR Security

// OCR runs completely on-premise
var document = await _documentService.UploadDocumentAsync(
    sensitiveImageStream,
    "confidential-document.jpg",
    "image/jpeg",
    "user-id",
    language: "eng"
    // Data is never sent to cloud
);

Next Steps

Advanced Configuration

Fallback providers and best practices

Advanced Configuration

Examples

Audio and OCR usage examples

View Examples