Audio & OCR
SmartRAG audio and OCR configuration - Whisper.net and Tesseract OCR settings
Audio & OCR Configuration
SmartRAG provides capabilities for converting audio files to text and extracting text from images:
Whisper.net (Local Audio Transcription)
Whisper.net provides local, on-premise audio transcription with support for 99+ languages:
WhisperConfig Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
ModelPath |
string |
"models/ggml-large-v3.bin" |
Path to Whisper model file |
DefaultLanguage |
string |
"auto" |
Language code for transcription |
MinConfidenceThreshold |
double |
0.3 |
Minimum confidence score (0.0-1.0) |
PromptHint |
string |
"" |
Context hint for better accuracy |
MaxThreads |
int |
0 |
CPU threads (0 = auto-detect) |
ForceTranscribeOnly |
bool |
true |
When true, only transcribe in the source language; never translate to English. |
UseGpu |
bool |
false |
When true, use GPU if your application references a GPU runtime (CUDA on Windows/Linux, CoreML on macOS). |
GPU acceleration
By default Whisper runs on CPU (package Whisper.net.Runtime). To use GPU, the application that references SmartRAG must add the matching Whisper.net runtime package and set WhisperConfig.UseGpu = true in configuration.
- Windows (NVIDIA): In your project, add package
Whisper.net.Runtime.Cuda.Windows, then setUseGpu = trueinSmartRAG:WhisperConfig. - Linux (NVIDIA): Add
Whisper.net.Runtime.Cuda.Linux, then setUseGpu = true. - macOS (Apple Silicon): Add
Whisper.net.Runtime.CoreML, then setUseGpu = true. If you see Metal init errors, leaveUseGpu = false.
SmartRAG does not reference GPU runtimes by default, so CPU-only deployments work without extra packages. Only add a GPU runtime if you want acceleration and your environment supports it.
Whisper Model Sizes
| Model | Size | Speed | Accuracy | Use Case |
|---|---|---|---|---|
tiny |
75MB | ⭐⭐⭐⭐⭐ | ⭐⭐ | Fast prototyping |
base |
142MB | ⭐⭐⭐⭐ | ⭐⭐⭐ | Balanced performance |
small |
244MB | ⭐⭐⭐ | ⭐⭐⭐⭐ | Good accuracy |
medium |
769MB | ⭐⭐ | ⭐⭐⭐⭐⭐ | High accuracy |
large-v3 |
1.5GB | ⭐ | ⭐⭐⭐⭐⭐ | Best accuracy |
Model Download
Whisper.net automatically downloads GGML models from Hugging Face on first use. Models are saved to the path specified in ModelPath configuration:
Automatic Download:
- Models are downloaded automatically when first used via
WhisperGgmlDownloader - Downloaded from Hugging Face repository
- Saved to the path specified in
ModelPath(default:models/ggml-large-v3.bin) - No manual download required
Model Files:
- Format:
ggml-{model-name}.bin(e.g.,ggml-base.bin,ggml-large-v3.bin) - Available models:
tiny,base,small,medium,large-v3 - First use downloads the model automatically (~5-10 minutes depending on connection and model size)
Configuration:
{
"SmartRAG": {
"WhisperConfig": {
"ModelPath": "models/ggml-large-v3.bin"
}
}
}
Important Notes:
- Whisper.net uses its own GGML model format and download system
- This is independent of Ollama, LM Studio, or cloud services
- Models are stored locally at the
ModelPathlocation - For on-premise deployments, ensure the application has write access to the model directory
- For cloud deployments, consider pre-downloading models or using persistent storage volumes
Transcribe vs Translate
Whisper supports two modes:
- Transcribe: Output text in the same language as the speech.
- Translate: Output always in English (SmartRAG never uses this mode).
SmartRAG always transcribes in the source language and never translates to English. When no language is specified (API and config use “auto”), Whisper auto-detects the language and outputs text in that language; we do not fall back to system locale, so server locale (e.g. “en”) is never forced. Use DefaultLanguage: "auto" for multi-language content; set a concrete code (e.g. "tr") only when you want to pin the language for all uploads. ForceTranscribeOnly (default true) documents that translation is disabled.
Configuration Example
{
"SmartRAG": {
"WhisperConfig": {
"ModelPath": "models/ggml-large-v3.bin",
"DefaultLanguage": "auto",
"ForceTranscribeOnly": true,
"MinConfidenceThreshold": 0.3,
"PromptHint": "",
"MaxThreads": 0
}
}
}
builder.Services.AddSmartRag(configuration, options =>
{
options.WhisperConfig = new WhisperConfig
{
ModelPath = "models/ggml-large-v3.bin",
DefaultLanguage = "auto",
ForceTranscribeOnly = true,
MinConfidenceThreshold = 0.3,
PromptHint = "",
MaxThreads = 0
};
});
auto- Auto-detect language and transcribe in that language (recommended for multi-language content).en- Englishtr- Turkishde- Germanfr- Frenches- Spanishit- Italianru- Russianja- Japaneseko- Koreanzh- Chinese- 99+ languages supported
Usage Example
// Upload audio file
var document = await _documentService.UploadDocumentAsync(
audioStream,
"meeting-recording.mp3",
"audio/mpeg",
"user-id"
);
// Ask AI about audio file
var response = await _aiService.AskAsync(
"What topics were discussed in this meeting?",
"user-id"
);
Privacy First
Audio files are processed locally using Whisper.net. No data leaves your machine - perfect for GDPR/KVKK/HIPAA compliance.
OCR Configuration
Tesseract OCR enables text extraction from images and PDFs with support for 100+ languages:
Tesseract Language Support
// Specify language for OCR when uploading images
var document = await _documentService.UploadDocumentAsync(
imageStream,
"invoice.jpg",
"image/jpeg",
"user-id",
language: "eng" // English OCR
);
// Turkish OCR
language: "tur"
// Multi-language
language: "tur+eng"
Supported OCR Languages
eng- Englishtur- Turkishdeu- Germanfra- Frenchspa- Spanishita- Italianrus- Russianara- Arabicchi- Chinesejpn- Japanesekor- Koreanhin- Hindi- 100+ languages supported
OCR Usage Examples
// Invoice analysis
var invoice = await _documentService.UploadDocumentAsync(
invoiceStream,
"invoice-2024-01.pdf",
"application/pdf",
"user-id",
language: "eng"
);
var analysis = await _aiService.AskAsync(
"What products are in this invoice and what is the total amount?",
"user-id"
);
// ID card analysis
var idCard = await _documentService.UploadDocumentAsync(
idCardStream,
"id-card.jpg",
"image/jpeg",
"user-id",
language: "eng"
);
var info = await _aiService.AskAsync(
"What is the person's name and birth date on this ID card?",
"user-id"
);
OCR Capabilities
OCR Capabilities
- ✅ Works perfectly: Printed documents, scanned text, digital screenshots
- ⚠️ Limited support: Handwritten text (very low accuracy)
- 💡 Best results: High-quality scans of printed documents
- 🔒 100% On-Premise: No data sent to cloud - Tesseract runs on-premise
Supported File Formats
Audio Formats:
audio/mpeg- MP3 filesaudio/wav- WAV filesaudio/m4a- M4A filesaudio/flac- FLAC filesaudio/ogg- OGG files
Image Formats:
image/jpeg- JPEG imagesimage/png- PNG imagesimage/tiff- TIFF imagesimage/bmp- BMP imagesimage/gif- GIF images
PDF Formats:
application/pdf- PDF documents (page-by-page OCR)
Audio Quality Tips
- Clear Audio: Avoid background noise and echo
- Good Microphone: Use quality recording equipment
- Correct Language: Specify the correct language of speech
- File Format: MP3, WAV, M4A formats work best
OCR Quality Tips
- High Resolution: At least 300 DPI scan quality
- Clean Image: Avoid blurry or shadowy images
- Correct Language: Specify the correct language of text in image
- Contrast: Prefer high-contrast, black-and-white images
Audio and OCR Comparison
Compare Whisper.net and Tesseract OCR capabilities:
| Feature | Whisper.net | Tesseract OCR |
|---|---|---|
| Data Privacy | 100% On-premise | 100% On-premise |
| Accuracy | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| Language Support | ⭐⭐⭐⭐⭐ (99+ languages) | ⭐⭐⭐⭐ (100+ languages) |
| Setup | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| Cost | Free | Free |
| Performance | ⭐⭐⭐⭐ | ⭐⭐⭐ |
Security and Privacy
Audio Security
// Whisper.net runs completely on-premise
var document = await _documentService.UploadDocumentAsync(
sensitiveAudioStream,
"confidential-meeting.mp3",
"audio/mpeg",
"user-id"
// Data is never sent to cloud
);
OCR Security
// OCR runs completely on-premise
var document = await _documentService.UploadDocumentAsync(
sensitiveImageStream,
"confidential-document.jpg",
"image/jpeg",
"user-id",
language: "eng"
// Data is never sent to cloud
);