Audio & OCR
SmartRAG audio and OCR configuration - Whisper.net and Tesseract OCR settings
Audio & OCR Configuration
SmartRAG provides capabilities for converting audio files to text and extracting text from images:
Whisper.net (Local Audio Transcription)
Whisper.net provides local, on-premise audio transcription with support for 99+ languages:
WhisperConfig Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
ModelPath |
string |
"models/ggml-large-v3.bin" |
Path to Whisper model file |
DefaultLanguage |
string |
"auto" |
Language code for transcription |
MinConfidenceThreshold |
double |
0.3 |
Minimum confidence score (0.0-1.0) |
IncludeWordTimestamps |
bool |
false |
Include word-level timestamps |
PromptHint |
string |
"" |
Context hint for better accuracy |
MaxThreads |
int |
0 |
CPU threads (0 = auto-detect) |
Whisper Model Sizes
| Model | Size | Speed | Accuracy | Use Case |
|---|---|---|---|---|
tiny |
75MB | ⭐⭐⭐⭐⭐ | ⭐⭐ | Fast prototyping |
base |
142MB | ⭐⭐⭐⭐ | ⭐⭐⭐ | Balanced performance |
small |
244MB | ⭐⭐⭐ | ⭐⭐⭐⭐ | Good accuracy |
medium |
769MB | ⭐⭐ | ⭐⭐⭐⭐⭐ | High accuracy |
large-v3 |
1.5GB | ⭐ | ⭐⭐⭐⭐⭐ | Best accuracy |
Model Download
Whisper.net automatically downloads GGML models from Hugging Face on first use. Models are saved to the path specified in ModelPath configuration:
Automatic Download:
- Models are downloaded automatically when first used via
WhisperGgmlDownloader - Downloaded from Hugging Face repository
- Saved to the path specified in
ModelPath(default:models/ggml-large-v3.bin) - No manual download required
Model Files:
- Format:
ggml-{model-name}.bin(e.g.,ggml-base.bin,ggml-large-v3.bin) - Available models:
tiny,base,small,medium,large-v3 - First use downloads the model automatically (~5-10 minutes depending on connection and model size)
Configuration:
{
"SmartRAG": {
"WhisperConfig": {
"ModelPath": "models/ggml-large-v3.bin"
}
}
}
Important Notes:
- Whisper.net uses its own GGML model format and download system
- This is independent of Ollama, LM Studio, or cloud services
- Models are stored locally at the
ModelPathlocation - For on-premise deployments, ensure the application has write access to the model directory
- For cloud deployments, consider pre-downloading models or using persistent storage volumes
Configuration Example
{
"SmartRAG": {
"WhisperConfig": {
"ModelPath": "models/ggml-large-v3.bin",
"DefaultLanguage": "auto",
"MinConfidenceThreshold": 0.3,
"IncludeWordTimestamps": false,
"PromptHint": "",
"MaxThreads": 0
}
}
}
builder.Services.AddSmartRag(configuration, options =>
{
options.WhisperConfig = new WhisperConfig
{
ModelPath = "models/ggml-large-v3.bin",
DefaultLanguage = "auto",
MinConfidenceThreshold = 0.3,
IncludeWordTimestamps = false,
PromptHint = "",
MaxThreads = 0
};
});
auto- Automatic language detection (recommended)en- Englishtr- Turkishde- Germanfr- Frenches- Spanishit- Italianru- Russianja- Japaneseko- Koreanzh- Chinese- 99+ languages supported
Usage Example
// Upload audio file
var document = await _documentService.UploadDocumentAsync(
audioStream,
"meeting-recording.mp3",
"audio/mpeg",
"user-id"
);
// Ask AI about audio file
var response = await _aiService.AskAsync(
"What topics were discussed in this meeting?",
"user-id"
);
Privacy First
Audio files are processed locally using Whisper.net. No data leaves your machine - perfect for GDPR/KVKK/HIPAA compliance.
OCR Configuration
Tesseract OCR enables text extraction from images and PDFs with support for 100+ languages:
Tesseract Language Support
// Specify language for OCR when uploading images
var document = await _documentService.UploadDocumentAsync(
imageStream,
"invoice.jpg",
"image/jpeg",
"user-id",
language: "eng" // English OCR
);
// Turkish OCR
language: "tur"
// Multi-language
language: "tur+eng"
Supported OCR Languages
eng- Englishtur- Turkishdeu- Germanfra- Frenchspa- Spanishita- Italianrus- Russianara- Arabicchi- Chinesejpn- Japanesekor- Koreanhin- Hindi- 100+ languages supported
OCR Usage Examples
// Invoice analysis
var invoice = await _documentService.UploadDocumentAsync(
invoiceStream,
"invoice-2024-01.pdf",
"application/pdf",
"user-id",
language: "eng"
);
var analysis = await _aiService.AskAsync(
"What products are in this invoice and what is the total amount?",
"user-id"
);
// ID card analysis
var idCard = await _documentService.UploadDocumentAsync(
idCardStream,
"id-card.jpg",
"image/jpeg",
"user-id",
language: "eng"
);
var info = await _aiService.AskAsync(
"What is the person's name and birth date on this ID card?",
"user-id"
);
OCR Capabilities
OCR Capabilities
- ✅ Works perfectly: Printed documents, scanned text, digital screenshots
- ⚠️ Limited support: Handwritten text (very low accuracy)
- 💡 Best results: High-quality scans of printed documents
- 🔒 100% On-Premise: No data sent to cloud - Tesseract runs on-premise
Supported File Formats
Audio Formats:
audio/mpeg- MP3 filesaudio/wav- WAV filesaudio/m4a- M4A filesaudio/flac- FLAC filesaudio/ogg- OGG files
Image Formats:
image/jpeg- JPEG imagesimage/png- PNG imagesimage/tiff- TIFF imagesimage/bmp- BMP imagesimage/gif- GIF images
PDF Formats:
application/pdf- PDF documents (page-by-page OCR)
Audio Quality Tips
- Clear Audio: Avoid background noise and echo
- Good Microphone: Use quality recording equipment
- Correct Language: Specify the correct language of speech
- File Format: MP3, WAV, M4A formats work best
OCR Quality Tips
- High Resolution: At least 300 DPI scan quality
- Clean Image: Avoid blurry or shadowy images
- Correct Language: Specify the correct language of text in image
- Contrast: Prefer high-contrast, black-and-white images
Audio and OCR Comparison
Compare Whisper.net and Tesseract OCR capabilities:
| Feature | Whisper.net | Tesseract OCR |
|---|---|---|
| Data Privacy | 100% On-premise | 100% On-premise |
| Accuracy | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| Language Support | ⭐⭐⭐⭐⭐ (99+ languages) | ⭐⭐⭐⭐ (100+ languages) |
| Setup | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| Cost | Free | Free |
| Performance | ⭐⭐⭐⭐ | ⭐⭐⭐ |
Security and Privacy
Audio Security
// Whisper.net runs completely on-premise
var document = await _documentService.UploadDocumentAsync(
sensitiveAudioStream,
"confidential-meeting.mp3",
"audio/mpeg",
"user-id"
// Data is never sent to cloud
);
OCR Security
// OCR runs completely on-premise
var document = await _documentService.UploadDocumentAsync(
sensitiveImageStream,
"confidential-document.jpg",
"image/jpeg",
"user-id",
language: "eng"
// Data is never sent to cloud
);