Regular expressions

Learn more about creating regular expressions to detect sensitive content in Safetica.

In this article, you will find:

 

Introduction: What regex standard Safetica uses

Safetica uses the ECMAScript standard for regexes. You can validate your regexes here. Do not forget to select the ECMAScript option in the menu on the left.

Data classifications in Safetica allow you to identify a wide range of sensitive content. Safetica comes with many predefined classification elements, which can be easily selected when creating a data classification (e.g. credit card numbers, IBANs, national IDs, etc.).

There are types of sensitive content, however, that are more difficult to detect (e.g. invoice numbers, PINs, etc.). For these types, a regular expression can be created, which finds the sensitive content within files accordingly. 

 

What are regular expressions

A regular expression is a sequence of characters that defines a search pattern. This pattern is compared with individual words in the file we analyze for sensitive content. Words are delimited by specific characters, such as: space, -, , , /, {}, [], ()

You can form regular expressions from the following characters: 

  • letters [a-z],[A-Z]
  • numbers [0-9]
  • characters  £&_–@,
  • meta characters .*+[](){}

Each meta character has a special meaning.

.                A dot matches any single character
*                An asterisk matches zero or more of the preceding characters
+               A plus sign matches one or more of the preceding characters
[ ]              Square brackets contain lists of characters. For example, [abc] matches any single character that is either a,b, or c
{}               Curly brackets contain the number of repetitions. (a){3} matches “aaa”
()               Round brackets are used to group expressions together and help to add separators

How you combine these characters depends on the content you are searching for. Generally, you can get the same results using various regular expressions. 

 

Tips for creating regular expressions

Some tips for creation of regular expressions:

  • [aA][bB][cC][dD][a-z,A-Z] – Ignore case sensitivity
  • [1-9]{1,9} – matches variable amount of numbers. Example matches 1 to 9 numbers
  • ([a-z,A-Z]{5})-([1-9]{5}) – round brackets group expressions together. This can help when the expression includes a separator. Example matches 5 characters and 5 numbers separated by hyphen.
  • .*[a-z].* – matches any number of any preceding or following characters. This is helpful when the expression is surrounded by other characters or lies in the middle of a word.

 

Examples of regular expressions

Please note that these expressions may generate a larger number of false positives.

If you want to tag all documents containing the word “Invoice4343” you can easily create a regular expression like this ( Invoice4343 ). However, if you want to tag all invoices, not only with number 4343, you can use the expression:

(Invoice[0-9]{4})
This will tag all invoices  (Invoice0000-Invoice9999).

We can also ignore upper and lower case.

Words “INVOICE4343” and “invoice4343” can be described in regular expressions as: [Ii][Nn][Vv][Oo][Ii][Cc][Ee][0-9]{4}

Word

Regular expression

Invoice4343

Invoice4343

Invoice0000-Invoice_9999

Invoice[0-9]{4}

invoice0000-INVOICE9999

[Ii][Nn][Vv][Oo][Ii][Cc][Ee][0-9]{4}

 

Specific words

Word

Forms

Regular expression

Payroll

Payroll, payrolls, …

.*[Pp]ayroll.*

Invoice

invoice, Invoices, …

.*[Ii]nvoice.*

Pin

PIN, pin, pIN

[Pp][Ii][Nn]

 

Phrases

Regular expressions offer possibility to search multiple words.

Phrase

Forms

Regular expression

European Central Bank

European - Central - Bank

European_Central_Bank

European  -     Central- Bank

...

^European$ ^Central$ ^Bank$

Daniel Brown

daniel.brown

<name> Daniel Brown </name>

Daniel – Brown

^[Dd]aniel$ ^[Bb]rown$

 

Regular expressions are a very complex topic. You can learn more about them at specialized websites, for example here.