Regular expression examples

Learn more about creating regular expressions to detect sensitive content in Safetica.

Data categories in Safetica allow you to identify a wide range of sensitive data. Besides context rules and third-party classification, Safetica also comes with predefined types of sensitive content, which can be easily selected in Safetica Management Console (e.g. credit card numbers, IBANs, national IDs, etc.).

There are types of sensitive content, however, that are more difficult to detect (e.g. invoice numbers, PINs, etc.). For these types, a regular expression can be created, which finds the sensitive content within files accordingly. 

In this article, you will find:

 

What are regular expressions

A regular expression is a sequence of characters that defines a search pattern. This pattern is compared with individual words in the file we scan for sensitive content. Words are delimited by specific characters, such as: space, -, , , /, {}, [], ()

You can form regular expressions from the following characters: 

  • letters [a-z],[A-Z]
  • numbers [0-9]
  • characters  £&_–@,
  • meta characters .*+[](){}

Each meta character has special meaning.

.           -           A dot matches any single character
*          -           An asterisk matches zero or more of the preceding characters
+         -           A plus sign matches one or more of the preceding characters
[ ]        -           Square brackets contain lists of characters. For example, [abc] matches any single character that is either a,b or c
{}        -           Curly brackets contain the number of repetitions. (a){3} matches “aaa”
()         -           Round brackets are used to group expressions together and help to add separators

How you combine these characters depends on the content you are searching for. Generally, you can get the same results using various regular expression. 

Tips for creating regular expressions

Some tips for creation of regular expressions:

  • [aA][bB][cC][dD][a-z,A-Z] – Ignore case sensitivity
  • [1-9]{1,9} – matches variable amount of numbers. Example matches 1 to 9 numbers
  • ([a-z,A-Z]{5})-([1-9]{5}) – round brackets group expressions together. This can help when the expression includes a separator. Example matches 5 characters and 5 numbers separated by hyphen.
  • .*[a-z].* – matches any number of any preceding or following characters. This is helpful when the expression is surrounded by other characters or lies in the middle of a word.

Examples of regular expressions

Please note that, these expressions may generate a larger number of false positives.

If you want to tag all documents containing the word “Invoice4343” you can easily create a regular expression like this ( Invoice4343 ). However, if you want to tag all invoices, not only with number 4343, you can use the expression:

(Invoice[0-9]{4})
This will tag all invoices  (Invoice0000-Invoice9999).

We can also ignore upper and lower case.

Words “INVOICE4343” and “invoice4343” can be described in regular expressions as: [Ii][Nn][Vv][Oo][Ii][Cc][Ee][0-9]{4}

Word

Regular expression

Invoice4343

Invoice4343

Invoice0000-Invoice_9999

Invoice[0-9]{4}

invoice0000-INVOICE9999

[Ii][Nn][Vv][Oo][Ii][Cc][Ee][0-9]{4}

 

Specific words

Word

Forms

Regular expression

Payroll

Payroll, payrolls, …

.*[Pp]ayroll.*

Invoice

invoice, Invoices, …

.*[Ii]nvoice.*

Pin

PIN, pin, pIN

[Pp][Ii][Nn]

 

Phrases

Regular expressions offer possibility to search multiple words.

Phrase

Forms

Regular expression

European Central Bank

European - Central - Bank

European_Central_Bank

European  -     Central- Bank

...

^European$ ^Central$ ^Bank$

Daniel Brown

daniel.brown

<name> Daniel Brown </name>

Daniel – Brown

^[Dd]aniel$ ^[Bb]rown$

 

Regular expressions are a very complex topic. You can learn more about them at specialized websites, for example here.