Audio & OCR Configuration

SmartRAG provides capabilities for converting audio files to text and extracting text from images:

Whisper.net (Local Audio Transcription)

Whisper.net provides local, on-premise audio transcription with support for 99+ languages:

WhisperConfig Parameters

Parameter Type Default Description
ModelPath string "models/ggml-large-v3.bin" Path to Whisper model file
DefaultLanguage string "auto" Language code for transcription
MinConfidenceThreshold double 0.3 Minimum confidence score (0.0-1.0)
IncludeWordTimestamps bool false Include word-level timestamps
PromptHint string "" Context hint for better accuracy
MaxThreads int 0 CPU threads (0 = auto-detect)

Whisper Model Sizes

Model Size Speed Accuracy Use Case
tiny 75MB ⭐⭐⭐⭐⭐ ⭐⭐ Fast prototyping
base 142MB ⭐⭐⭐⭐ ⭐⭐⭐ Balanced performance
small 244MB ⭐⭐⭐ ⭐⭐⭐⭐ Good accuracy
medium 769MB ⭐⭐ ⭐⭐⭐⭐⭐ High accuracy
large-v3 1.5GB ⭐⭐⭐⭐⭐ Best accuracy

Model Download

Whisper.net automatically downloads GGML models from Hugging Face on first use. Models are saved to the path specified in ModelPath configuration:

Automatic Download:

  • Models are downloaded automatically when first used via WhisperGgmlDownloader
  • Downloaded from Hugging Face repository
  • Saved to the path specified in ModelPath (default: models/ggml-large-v3.bin)
  • No manual download required

Model Files:

  • Format: ggml-{model-name}.bin (e.g., ggml-base.bin, ggml-large-v3.bin)
  • Available models: tiny, base, small, medium, large-v3
  • First use downloads the model automatically (~5-10 minutes depending on connection and model size)

Configuration:

{
  "SmartRAG": {
    "WhisperConfig": {
      "ModelPath": "models/ggml-large-v3.bin"
    }
  }
}

Important Notes:

  • Whisper.net uses its own GGML model format and download system
  • This is independent of Ollama, LM Studio, or cloud services
  • Models are stored locally at the ModelPath location
  • For on-premise deployments, ensure the application has write access to the model directory
  • For cloud deployments, consider pre-downloading models or using persistent storage volumes

Configuration Example

{
  "SmartRAG": {
    "WhisperConfig": {
      "ModelPath": "models/ggml-large-v3.bin",
      "DefaultLanguage": "auto",
      "MinConfidenceThreshold": 0.3,
      "IncludeWordTimestamps": false,
      "PromptHint": "",
      "MaxThreads": 0
    }
  }
}
builder.Services.AddSmartRag(configuration, options =>
{
    options.WhisperConfig = new WhisperConfig
    {
        ModelPath = "models/ggml-large-v3.bin",
        DefaultLanguage = "auto",
        MinConfidenceThreshold = 0.3,
        IncludeWordTimestamps = false,
        PromptHint = "",
        MaxThreads = 0
    };
});
  • auto - Automatic language detection (recommended)
  • en - English
  • tr - Turkish
  • de - German
  • fr - French
  • es - Spanish
  • it - Italian
  • ru - Russian
  • ja - Japanese
  • ko - Korean
  • zh - Chinese
  • 99+ languages supported

Usage Example

// Upload audio file
var document = await _documentService.UploadDocumentAsync(
    audioStream,
    "meeting-recording.mp3",
    "audio/mpeg",
    "user-id"
);

// Ask AI about audio file
var response = await _aiService.AskAsync(
    "What topics were discussed in this meeting?",
    "user-id"
);

Privacy First

Audio files are processed locally using Whisper.net. No data leaves your machine - perfect for GDPR/KVKK/HIPAA compliance.

OCR Configuration

Tesseract OCR enables text extraction from images and PDFs with support for 100+ languages:

Tesseract Language Support

// Specify language for OCR when uploading images
var document = await _documentService.UploadDocumentAsync(
    imageStream,
    "invoice.jpg",
    "image/jpeg",
    "user-id",
    language: "eng"  // English OCR
);

// Turkish OCR
language: "tur"

// Multi-language
language: "tur+eng"

Supported OCR Languages

  • eng - English
  • tur - Turkish
  • deu - German
  • fra - French
  • spa - Spanish
  • ita - Italian
  • rus - Russian
  • ara - Arabic
  • chi - Chinese
  • jpn - Japanese
  • kor - Korean
  • hin - Hindi
  • 100+ languages supported

OCR Usage Examples

// Invoice analysis
var invoice = await _documentService.UploadDocumentAsync(
    invoiceStream,
    "invoice-2024-01.pdf",
    "application/pdf",
    "user-id",
    language: "eng"
);

var analysis = await _aiService.AskAsync(
    "What products are in this invoice and what is the total amount?",
    "user-id"
);

// ID card analysis
var idCard = await _documentService.UploadDocumentAsync(
    idCardStream,
    "id-card.jpg",
    "image/jpeg",
    "user-id",
    language: "eng"
);

var info = await _aiService.AskAsync(
    "What is the person's name and birth date on this ID card?",
    "user-id"
);

OCR Capabilities

OCR Capabilities

  • ✅ Works perfectly: Printed documents, scanned text, digital screenshots
  • ⚠️ Limited support: Handwritten text (very low accuracy)
  • 💡 Best results: High-quality scans of printed documents
  • 🔒 100% On-Premise: No data sent to cloud - Tesseract runs on-premise

Supported File Formats

Audio Formats:

  • audio/mpeg - MP3 files
  • audio/wav - WAV files
  • audio/m4a - M4A files
  • audio/flac - FLAC files
  • audio/ogg - OGG files

Image Formats:

  • image/jpeg - JPEG images
  • image/png - PNG images
  • image/tiff - TIFF images
  • image/bmp - BMP images
  • image/gif - GIF images

PDF Formats:

  • application/pdf - PDF documents (page-by-page OCR)

Audio Quality Tips

  1. Clear Audio: Avoid background noise and echo
  2. Good Microphone: Use quality recording equipment
  3. Correct Language: Specify the correct language of speech
  4. File Format: MP3, WAV, M4A formats work best

OCR Quality Tips

  1. High Resolution: At least 300 DPI scan quality
  2. Clean Image: Avoid blurry or shadowy images
  3. Correct Language: Specify the correct language of text in image
  4. Contrast: Prefer high-contrast, black-and-white images

Audio and OCR Comparison

Compare Whisper.net and Tesseract OCR capabilities:

Feature Whisper.net Tesseract OCR
Data Privacy 100% On-premise 100% On-premise
Accuracy ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
Language Support ⭐⭐⭐⭐⭐ (99+ languages) ⭐⭐⭐⭐ (100+ languages)
Setup ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
Cost Free Free
Performance ⭐⭐⭐⭐ ⭐⭐⭐

Security and Privacy

Audio Security

// Whisper.net runs completely on-premise
var document = await _documentService.UploadDocumentAsync(
    sensitiveAudioStream,
    "confidential-meeting.mp3",
    "audio/mpeg",
    "user-id"
    // Data is never sent to cloud
);

OCR Security

// OCR runs completely on-premise
var document = await _documentService.UploadDocumentAsync(
    sensitiveImageStream,
    "confidential-document.jpg",
    "image/jpeg",
    "user-id",
    language: "eng"
    // Data is never sent to cloud
);

Next Steps

Advanced Configuration

Fallback providers and best practices

Advanced Configuration

Examples

Audio and OCR usage examples

View Examples