Regular expression examples

Data categories in Safetica DLP allow a wide range of possibilities for identification of sensitive content. Beside context tags and third-party classification, Safetica comes with predefined categories of sensitive content, which can be easily toggled in Safetica Management Console. These categories are based on algorithms, which check whether content is really sensitive, thus reducing the number of false positives.

There are types of sensitive content, however, that are more difficult to detect, due to the nature of their creation. For these types, a regular expression can be set, which finds the sensitive content within files accordingly. 

A regular expression is a sequence of characters that defines a search pattern. This pattern is compared with individual words.

You can form regular expressions from the following characters: 

  • letters [a-z],[A-Z]
  • numbers [0-9]
  • characters  £&_–@,
  • meta characters .*+[](){}

Each meta character has special meaning.

.           -           A dot matches any single character
*          -           An asterisk matches zero or more of the preceding characters
+         -           A plus sign matches one or more of the preceding characters
[ ]        -           Square brackets contain lists of characters. For example, [abc] matches any single character that is either a,b or c
{}        -           Curly brackets contain the number of repetitions. (a){3} matches “aaa”
()         -           Round brackets are used to group expressions together and helps to add separators

How you combine these characters depends on the content you are searching for. Generally, you can get the same results using various regular expression. 

Some tips for creation of regular expressions:

  • [aA][bB][cC][dD][a-z,A-Z] – Ignore case sensitivity
  • [1-9]{1,9} – matches variable amount of numbers. Example matches 1 to 9 numbers
  • ([a-z,A-Z]{5})-([1-9]{5}) – round brackets group expressions together. This can help when the expression includes a separator. Example matches 5 characters and 5 numbers separated by hyphen.
  • .*[a-z].* – matches any number of any preceding or following characters. This is helpful when the expression is surrounded by other characters or lies in the middle of a word.

Here you can find some examples of regular expressions.

Please note that, these expressions may generate a larger number of false positives.

If you want to tag all documents containing the word “Invoice4343” you can easily create a regular expression like this ( Invoice4343 ). However, if you want to tag all invoices, not only with number 4343, you can use the expression:

(Invoice[0-9]{4})
This will tag all invoices  (Invoice0000 - Invoice9999).

We can also ignore upper and lower case.

Words “INVOICE4343” and “invoice4343” can be described in regex as: [Ii][Nn][Vv][Oo][Ii][Cc][Ee][0-9]{4}

Word

Regular expression

Invoice4343

Invoice4343

Invoice0000 - Invoice_9999

Invoice[0-9]{4}

invoice0000 - INVOICE9999

[Ii][Nn][Vv][Oo][Ii][Cc][Ee][0-9]{4}

Specific words

Word

Forms

Regular expression

Payroll

Payroll, payrolls, …

.*[Pp]ayroll.*

Invoice

invoice, Invoices, …

.*[Ii]nvoice.*

pin

PIN, pin, pIN

[Pp][Ii][Nn]

 

 

 

 

Phrases

Regular expressions offer possibility to search multiple words.

Phrase

Forms

Regular expression

European Central Bank

European - Central - Bank

European_Central_Bank

European  -     Central- Bank

...

^European$ ^Central$ ^Bank$

Daniel Brown

daniel.brown

<name> Daniel Brown </name>

Daniel – Brown

^[Dd]aniel$ ^[Bb]rown$