OCR Extractor
Pulls text out of PDFs and images. Multiple AI engines, multiple languages. Live preview for a single file, or batch a whole folder.
What it does
Use this to extract data from scanned invoices, digitise old documents, turn image text into editable text, or make a scanned PDF archive searchable.
Note: This tool is universal (PDFs and images). For PDF-only the pdf_tools/ocr is simpler and faster.
How to use
Three modes:
Single (single-file live preview)
- Drag the file.
- The preview appears on the left.
- Pick language and engine.
- Recognised text updates live on the right.
Batch
- Add multiple files or a folder.
- Pick an Output Format: TXT, JSON, DOCX, or searchable PDF.
- Click Run.
Settings
Saves defaults for engine, language, DPI, confidence threshold.
Supported formats
Input: JPG, PNG, WebP, BMP, TIFF, PDF.
Output: TXT, JSON (structured), DOCX, Searchable PDF.
OCR engines
| Engine | Trait |
|---|---|
| Tesseract (default) | Fast, broad language support. |
| EasyOCR | Better on complex text. Speeds up with a GPU. |
| PaddleOCR | Good for Asian languages. GPU supported. |
Language options
Tesseract uses every installed language pack. Default is English + Turkish (eng, tur). Turkish needs tur.traineddata installed in Tesseract.
DPI options
For PDF rendering: 150, 200, 300, 400, 600. Default 300 is balanced, 600 is high quality but slow.
Examples
Pull data from an invoice: Single mode, drag the invoice image, pick Turkish, copy the text.
Make a scanned archive searchable: Batch mode, add the folder, output Searchable PDF, run.
Handwritten note photos to text: Single mode, EasyOCR, Turkish, gives better results on cursive.
Extract structured JSON: Batch mode, add the form scans, output JSON, run. For programmatic use.
Watch out
- Tesseract must be installed on the system.
- EasyOCR and PaddleOCR need their Python packages installed.
- Turkish needs an extra language pack.
- If the PDF already has a text layer, OCR is overkill, extract the text directly.
- Very low-resolution images hurt accuracy.
- Handwriting is limited, typewritten/digital text is the sweet spot.
License
This tool is Ultimate only. Disabled in the Free and Office plans.