How OCR works

Enable and set up OCR to detect sensitive content in images, photos, and scanned documents. Learn about the limitations of OCR, how to enable/disable OCR globally or for specific devices, and how to select OCR languages to enhance detection accuracy.

In this article, you will learn:

 

How OCR works in Safetica

Safetica analyzes the content in selected file types. With OCR, you can expand the scope to images, such as photos and scanned documents. OCR converts text in images into machine-encoded text. This is particularly useful for example to recognize text in documents that were optically scanned in a scanner machine. When Safetica converts the text from the image, standard sensitive content analysis is applied.

Our OCR supports the following image types: .png, .tiff, .jpg, .jpeg, .jpe, .bmp.

We can, however, extract images also from other file types, such as .pdf documents, presentation files, ebooks, etc. You can find the list of supported formats here.

OCR is a global setting that applies to all devices.

OCR limitations

Safetica's OCR technology is primarily designed to extract sensitive content from scanned text documents. The scan quality is optimized to balance reasonable performance demands and the highest possible accuracy. Therefore, the technology may have difficulty processing certain types of text, such as:

  • text blended with background
  • low-quality images
  • scattered words without a clear paragraph structure
  • handwritten text
  • and more.

As mentioned above, OCR is not bulletproof. We recommend testing its accuracy on sample documents and using it as a complement to other security measures. If you feel OCR does not work according to your expectations, please contact Safetica Support.

 

How to enable/disable OCR globally

To enable OCR for all devices in your company:

  1. In Safetica console, go to Data classification and click Content analysis settings.
  2. You can enable/disable OCR for all devices via the OCR checkbox.
  3. Save your settings.

 

How to activate/deactivate OCR for a specific device

  1. Open the Safetica Maintenance Console and go to Maintenance > Endpoint deactivation.
  2. Select the device or group for whom you want to activate/deactivate OCR in the user tree.
  3. In the Protection features section, select the desired option with the Optical character recognition slider.



How to select an OCR language

OCR searches for characters that are present in a selected character set, so you must choose the language character set.

You can choose 2 different languages to be scanned by OCR, which may be useful for customers with multilingual environments. However, we do not recommend setting a secondary language unless you need to analyze texts written in different alphabets (for example English and Chinese) since this will impact device performance. A document in German will be most likely correctly processed by English character set as both languages share the same alphabet.

Setting a secondary language can improve detection accuracy in multilingual environments. It is useful mainly for different character sets, such as Cyrillic vs Latin vs Chinese alphabets.

Languages without special characters are a subset of languages with special characters. This means that if you set e.g. Czech or German as your primary language, then setting English as a secondary language is not necessary, since all the characters of English are already contained in Czech/German character sets.

 

Read next:

Data classification in Safetica

Data classification: What is Safetica unified classification

Data classification: How to create a new data classification

Data classification: How to select file types for content analysis

Policies: How they work in Safetica