Safetica content detection of sensitive data
Posted by Štěpán Horký, Last modified by Michael Skoupý on 07 November 2019 02:58 PM
Safetica 9.1 brings a more flexible configuration of sensitive data detection, which allows more accurate results and a lower rate of false-positive matches.
With the new content detection configuration you can:
A detection rule is a set of conditions which, when they are met, evaluate the data as sensitive.
The example configuration could be used to detect financial documents.
Detection rule #1 is matched when a credit card number AND the word "card" are found in a document within a range of 1,800 characters. Detection rule #2 is matched when the word "invoice" is found. Detection rule #3 is matched when the word "order" if found at least 5 times.
If detection rule #1 OR detection rule #2 OR detection rule #3 are matched anywhere in the file, the document will be evaluated as sensitive.
If you specify several conditions within 1 detection rule, all of these must be matched. In other words, the relationship between conditions within 1 detection rule is "AND". For the example above, detection rule #1 is matched when both a credit card number AND the word "card" are found.
If you specify more detection rules within 1 data category, at least 1 of these must be matched. The relationship between several detection rules is "OR". For the example above, any of the three conditions must be matched to detect the configured data category.
The detection range is in place in order to increase the accuracy of results and lower the number of false positive matches.
Detection range in practice means that "AND" rules must be matched within a range of 1,800 characters - roughly the amount of text which fits a single A4 page. This range is applied on a plain text version of files and does not consider actual document pages.
The threshold setting specifies the number of occurrences of the detection rule which must be reached to evaluate the data as sensitive.
Setting the threshold to "1" will detect every occurrence of the detection rule - this is suitable for covering all files which contain at least a single occurrence of sensitive content. Setting the threshold to "100" will only detect data where the detection rule is found 100x times in a single file. This offers flexibility in such way that a threshold at "1" may generate more false positive results but detects every file which fulfills the condition. Threshold at "100", on the other hand, eliminates false positives and only detects files which contain a large amount of sensitive data.
The default threshold value is set to "5" to lower the number of false positive matches.
After updating to version 9.1, existing sensitive content configurations will be converted to the new system. Previously, sensitive data detection only used OR conditions, therefore, in order to maintain this logic, existing configurations will be converted into individual detection rules.
The new detection rules are backward compatible with clients of older version to a certain degree:
On older endpoint clients, detection rule #1 is ignored because it includes multiple conditions. Detection rules #2 and #3 are both applied with threshold at 5, since it is the highest set value. Consequently, this is the applied configuration:
Safetica 9.3+ supports importing custom keyword dictionaries. These can contain a list of customer names or identifiers, project names, technical terms or other keywords which often signify sensitive content (common words and phrases in contracts, personal CVs, etc.)
In order to preserve endpoint performance, certain limitations are imposed on using custom dictionaries: