Choosing an OCR in 2025: the checklist
Last update:
February 7, 2025
5 minutes
Lost in the wide range of OCR solutions? Accuracy, speed, ease of use, flexibility, budget—check the 17 key points below to compare data extraction tools. Note that we are focusing here on short documents with specific formats (invoices, bank statements, tax notices, forms, etc.)
.png)
1. Success Rate on Your Use Cases
What is a good success rate for an OCR? For recognizing simple and unique fields, such as the total amount on an invoice, the name of a supplier, or the account holder's name, the achievable success rate is 99%.
For complex fields, like invoice lines with many fields, you can reach 95-96%.
If you're below these standards, it’s worth testing another tool to assess the potential quality gain. However, some documents remain complex, and the technology might not yet be able to meet the challenge.
A key point: providers often display success rates. These remain generic: use cases are always different, and it's essential to put the tool to the test on your documents. For a reliable test, gather a set of 20 documents of the same type to measure quality.
2. Human-in-the-loop Process
An OCR solution will never achieve 100% reliability for extracting your fields. The goal is to automatically isolate documents with error risks.
Check if the tool provides a confidence score to identify files with a low rate. Are the scores applied at the level of each extracted field? Are they reliable? Can a score be determined below which human verification should always be systematic?
Does the tool allow you to set up alerts when a condition is met? Examples of specific detections: documents longer than average, documents with annexes, documents photographed at an angle, or "off-topic" documents.
3. API Integration with Your Tools
Does the tool offer an API and a complete SDK (software development kit) with full documentation? The output formats of the API should be common and easily exploitable by developers (JSON, XML, or CSV). Native integrations with your tools can also be considered: Google Drive, Slack, or your ERP.
Beyond receiving or sending information, the API’s features may include: creating "document extraction templates" and routing them automatically to a document template, selecting pages to process, or excluding certain documents.
4. Budget per Page
For a volume of 1,000 to 10,000 pages per month, the budget ranges from €0.08 to €0.40 per page, depending on the tool’s power and features. There are open-source solutions like Tesseract that can be used directly on the cloud. Only hosting costs need to be anticipated.
Let me know if you'd like any further modifications!
5. Speed
The processing speed depends on how the OCR works. If it's a traditional OCR engine, with machine learning, or an OCR model based on LLMs (to learn more about this difference, refer to our article on the subject).
Traditional OCRs tend to be faster: 1 to 4 seconds per document, compared to 5-10 seconds for LLM-based OCRs using LLM vision technology. Tools like Koncile present a hybrid model combining both techniques to achieve better results.
6. Adding and Editing Custom Fields
If you want to add a field or give specific instructions to format an output data, turn to OCRs with an LLM component. Traditional OCRs often have a fixed list of fields extracted from documents.
Let’s take an example: you want to extract a supplier’s name, and you want the tool to "choose" from a list of 5 suppliers, keeping the exact same text. This will be possible in the case of an LLM OCR, where you can specify this condition within a prompt.
7. Usage and Access for Non-developers
The quality test of an OCR should be validated by business experts and the end-users of the extracted data, not just by technical teams. Therefore, check if it’s easy for a non-developer to consult the OCR platform and perform configurations.
In the case of LLM OCRs, a field-definition platform can be specifically designed for business people, so they can provide specific instructions for extraction.
8. Formatting, Correction, and Enrichment of Data
An OCR might only extract raw data from the document. For good data utilization, check if the OCR includes automatic formatting: date in English or French format, numbers in international format, currency format, etc.
An OCR with LLM features can also perform enrichments and categorizations of extracted data. For example: deducing the city from a postal code, checking the consistency between 2 extracted data, or answering a simple yes/no question.
Check other possible enrichment examples on this page of our technical documentation.
9. Performance on Tables
With OCRs, you can extract two types of information: (i) unique information, such as the name of an identity card holder, the name of the supplier on an estimate, or the total amount of an invoice, and (ii) repeated information or information presented in tables.
Only some OCRs, like Koncile, allow parsing each piece of information from every line of a table and return a file with all the rows.
10. Performance on Handwritten Text
Some tools specialize in extracting handwritten text with specific recognition models (HTR – Handwritten Text Recognition). It is important to check the tool’s performance on your documents by testing several types of handwriting (cursive, script, quick annotations, etc.).
Traditional OCRs often struggle to accurately extract complex handwritten text, while models based on LLM or deep learning offer better performance.
Some tools also allow training models on your own data set to improve recognition of handwriting specific to your domain.
11. Performance on Low-Resolution Photos
Many documents are scanned or photographed with varying quality. Some OCRs offer automatic correction to improve the readability of degraded documents. It is essential to check how the tool handles blurry images, photos with shadows, creases, or slanted documents.
A good OCR will include pre-processing technologies like contrast enhancement, perspective correction, or automatic document straightening.
12. Multilingual Performance and Special Characters
If you handle documents in multiple languages, the OCR must be able to correctly detect and extract information without confusion between similar languages (French/Spanish, German/Dutch, etc.). Additionally, some documents contain specific characters, such as currency symbols (€,$,¥), diacritics (accents, cedillas), or non-Latin alphabets (Cyrillic, Chinese, Arabic). Ensure that the OCR supports these elements.
13. Performance on Page Breaks
Some short documents may be scanned across several pages, especially when they contain tables, annexes, or separate signatures.
A good OCR must be able to correctly associate data from different pages and reconstruct related information. Managing page breaks is particularly important when processing invoices and bank statements. Check if the tool allows merging the extracted data into a single file or if it automatically segments each page.
14. Automatic Document Categorization
An advanced OCR does not just extract text, it should also be able to automatically classify documents based on their type. For example, recognizing that a file is an invoice, not a bank statement, or identifying the supplier of a document automatically.
This feature is particularly useful if you handle large volumes of varied documents. Some OCRs use AI models to classify documents and direct information extraction to the appropriate processing templates.
15. Security and Compliance
Data security is a key factor when choosing an OCR, especially if you're handling sensitive documents containing personal or financial information. Check if the solution complies with applicable standards, such as GDPR (for Europe), CCPA (for California), or ISO 27001.
Ensure that the tool also offers encryption mechanisms for data in transit and at rest, as well as access controls to limit the risk of data leaks.
16. Data Storage
Some OCR solutions store processed documents temporarily or permanently on their servers. It’s important to understand where and for how long these data are kept. If you are handling sensitive documents, prioritize solutions that offer immediate file deletion after processing or allow hosting the data on your own infrastructure.
Also, check if the tool integrates with your existing storage solutions, such as Google Drive, AWS S3, or an internal server.
17. On-premise Deployment
If you have strict confidentiality requirements or if your company prohibits the use of external cloud services, it may be essential to choose an OCR solution that offers “on-premise” deployment (installed on your own servers). This will give you full control over your data and allow you to tailor processing capabilities to your internal needs.
However, not all OCR solutions offer this option. Also, verify if on-premise deployment requires a powerful server and what the maintenance and update constraints are.
Try Koncile today
Resources
This article presents the deployment of electronic invoicing in Europe.
Blog

This article presents methods currently used to extract tables from scanned documents.
Practical guide

Article presenting a list of 8 interesting features to have in your ERP if you work in the construction sector.
Practical guide