Learn more about creating regular expressions to detect sensitive content in Safetica.
In this article, you will find:
- What regex standard Safetica uses
- What are regular expressions
- Tips for creating regular expressions
- Examples of regular expressions
Introduction: What regex standard Safetica uses
Safetica uses the ECMAScript standard for regexes. You can validate your regexes here. Do not forget to select the ECMAScript option in the menu on the left.
Data classifications in Safetica allow you to identify a wide range of sensitive content. Safetica comes with many predefined classification elements, which can be easily selected when creating a data classification (e.g. credit card numbers, IBANs, national IDs, etc.).
There are types of sensitive content, however, that are more difficult to detect (e.g. invoice numbers, PINs, etc.). For these types, a regular expression can be created, which finds the sensitive content within files accordingly.
What are regular expressions
A regular expression is a sequence of characters that defines a search pattern. This pattern is compared with individual words in the file we analyze for sensitive content. Words are delimited by specific characters, such as: space, -, , , /, {}, [], ()
You can form regular expressions from the following characters:
- letters [a-z],[A-Z]
- numbers [0-9]
- characters £&_–@,
- meta characters .*+[](){}
Each meta character has a special meaning.
. A dot matches any single character
* An asterisk matches zero or more of the preceding characters
+ A plus sign matches one or more of the preceding characters
[ ] Square brackets contain lists of characters. For example, [abc] matches any single character that is either a,b, or c
{} Curly brackets contain the number of repetitions. (a){3} matches “aaa”
() Round brackets are used to group expressions together and help to add separators
How you combine these characters depends on the content you are searching for. Generally, you can get the same results using various regular expressions.
Tips for creating regular expressions
Some tips for creation of regular expressions:
- [aA][bB][cC][dD][a-z,A-Z] – Ignore case sensitivity
- [1-9]{1,9} – matches variable amount of numbers. Example matches 1 to 9 numbers
- ([a-z,A-Z]{5})-([1-9]{5}) – round brackets group expressions together. This can help when the expression includes a separator. Example matches 5 characters and 5 numbers separated by hyphen.
- .*[a-z].* – matches any number of any preceding or following characters. This is helpful when the expression is surrounded by other characters or lies in the middle of a word.
Examples of regular expressions
Please note that these expressions may generate a larger number of false positives.
If you want to tag all documents containing the word “Invoice4343” you can easily create a regular expression like this ( Invoice4343 ). However, if you want to tag all invoices, not only with number 4343, you can use the expression:
(Invoice[0-9]{4})
This will tag all invoices (Invoice0000-Invoice9999).
We can also ignore upper and lower case.
Words “INVOICE4343” and “invoice4343” can be described in regular expressions as: [Ii][Nn][Vv][Oo][Ii][Cc][Ee][0-9]{4}
Word |
Regular expression |
Invoice4343 |
Invoice4343 |
Invoice0000-Invoice_9999 |
Invoice[0-9]{4} |
invoice0000-INVOICE9999 |
[Ii][Nn][Vv][Oo][Ii][Cc][Ee][0-9]{4} |
Specific words
Word |
Forms |
Regular expression |
Payroll |
Payroll, payrolls, … |
.*[Pp]ayroll.* |
Invoice |
invoice, Invoices, … |
.*[Ii]nvoice.* |
Pin |
PIN, pin, pIN |
[Pp][Ii][Nn] |
Phrases
Regular expressions offer possibility to search multiple words.
Phrase |
Forms |
Regular expression |
European Central Bank |
European - Central - Bank European_Central_Bank European - Central- Bank ... |
^European$ ^Central$ ^Bank$ |
Daniel Brown |
daniel.brown <name> Daniel Brown </name> Daniel – Brown … |
^[Dd]aniel$ ^[Bb]rown$ |
Regular expressions are a very complex topic. You can learn more about them at specialized websites, for example here.