How to create a data classification

Learn how to create data classifications, what are classification rules and elements, and what types of elements you can combine to protect your sensitive data.

Data classifications allow you to define what data is considered sensitive in your company and help you classify files into different groups based on who, where, and how can work with them. Data classifications in Safetica combine various classification technologies. You don’t have to decide which technology to use, you simply create rules with a combination of various elements. Then you use the data classification in policies to detect and protect files that contain sensitive information.

In this article, you will learn:

What are rules and elements

Rules and elements are two layers that define the scope of a data classification.

Example: A GDPR data classification consists of several rules related to personal data from various EU countries. Each rule consists of several elements that specify that rule, usually, these combine regexes and keywords.

Data classification

Rule

Element

GDPR

Austrian passport numbers

[A-Za-z] ?\d{7}

österreichisch reisepass, reisepass

French driver’s license numbers

\d{12}

permis de conduire, drivers license

Czech birth numbers

\b[0-9]{2}(?:[0257][1-9]|[1368][0-2])(?:0[1-9]|[12][0-9]|3[01])/?[0-9]{3,4}\b

Rodné číslo, identifikační číslo, osobní identifikační číslo, czech republic id

German social security numbers

[0-9]{2} ?[0-3][0-9][0-1][0-9][0-9]{2} ?[A-Z][0-9]{3}

ausweis, identifizierungsnummer, personalausweis, sozialversicherungsausweis, sozialversicherungsnummer, versicherungsnummer

During policy evaluation, a classification is applied, if at least one of its rules is matched (OR relationship between rules). Also, all elements must be matched for the rule to be valid (AND relationship between elements).

Example: A document contains Czech birth numbers and the term “rodné číslo”. Our previously-defined GDPR data classification will apply to this document, because one of its rules (Czech birth numbers) is matched. The rule is valid, because both elements contained in it were matched (the regex Czech birth numbers and the keyword “rodné číslo”).

 

How to create a data classification

  1. Open Safetica console.
  2. Go to the Data classification section and click Add classification.
  3. Name the data classification and add its description.
  4. Now create a rule. In the Rules section, click Add rule.
  5. Click Add element.
  6. Set up all the elements you want to have in the rule. You can read more about elements below.
  7. Name the rule and click Save.
  8. You can add more rules or save the data classification.

 

What elements can you use in your data classification

Elements are divided into 4 sections, and you can combine them as needed:

1. Elements related to content analysis

You can specify what file content should be considered sensitive. Safetica utilizes three methods for sensitive content analysis - predefined algorithms (built-in healthcare, financial, and personal details algorithms), keywords, and regexes. I.e. you can define that a file is sensitive if it contains certain keywords, matches our predefined algorithms, or your own custom regular expressions.

Dictionaries: You can copy and paste up to 10,000 keywords into one rule to serve as a custom dictionary. The keywords must be separated by line breaks.

Click to set the detection trigger for the whole sensitive data section.

Click or the Add element button to add more elements to the rule.

What is the detection trigger and how it works

You can narrow down the number of classified documents and avoid false positives by enabling the detection trigger. The detection trigger defines the minimum number of occurrences of sensitive data that must be found in a file for the data classification to apply. It only works for elements related to content analysis.

The detection trigger works on the rule level, so it applies to all combinations of keywords, regexes, predefined algorithms, and dictionaries that you choose for the rule.

Example: A company works with documents that contain a birth number on daily basis. It is unusual, however, for one document to contain several birth numbers. They set the detection trigger to 5 on the “Czech birth numbers” rule. This way, only files with 5 combinations of the birth number and the “rodné číslo” keyword will be classified with the GDPR data classification. Files with less than 5 matches will not be classified.

 

Duplicate occurrences are counted as one match only.

This means that if you have one keyword (e.g. the word "confidential") in a document multiple times, it is counted as one match.

Example 1: You create a rule with the pre-defined algorithm for credit card numbers and set the threshold to 5.

Files with 5 or more unique credit card numbers will hit the threshold and will be considered sensitive. Files with 5 or more identical credit card numbers will not be considered sensitive, because duplicate occurrences are counted as one match only.

Example 2: You create a rule with 10 keywords (e.g. “invoice”, “confidential”, “credit card”, ….) and set the threshold to 5.

To hit the threshold and be considered sensitive, a document must contain 5 or more different keywords. Files that contain 5 identical keywords will not be considered sensitive, because duplicate occurrences are counted as one match only.

Example 3: You create a rule combining the pre-defined algorithm for credit card numbers and a keyword (e.g. “confidential”) and set the threshold to 5.

Documents that contain the word “confidential” together with five different credit card numbers will hit the threshold and be considered sensitive.

 

2. Elements related to from where the file was transferred

You can specify through which places the file went - Safetica can classify files that were transferred from:

  • an app category - Choose one or more categories from the drop-down list. Any file that is exported from the selected application category will be classified. The App category element is not supported on macOS devices yet.
  • a file path - Enter one or more local or network file paths. All files that are currently stored in these paths will be classified. Also, if at any time when the classification is in place, a protected user creates/copies a file to the selected file paths, it will be classified too. Files carry the classification even when they later move to different locations.
  • a website - Enter one or more web addresses; any file downloaded from the website will be classified.
You can define e.g. that a file is sensitive if it comes from a CRM system OR accounting software AND is also stored in a particular location.

3. Elements related to file properties

You can specify the file types to which the rule should apply. Classifying files by their type allows you to add a more granular level to data classification. It is commonly used in combination with other criteria to narrow down restrictions (or exceptions) to selected file types only.

In the file type element, you can either choose a single file extension or add a whole category of extensions:

4. Elements related to already existing data classification

If you already use a 3rd party data classification tool, you can define that files classified by this tool are sensitive. Safetica is universally compatible with third-party classification tools that store classification information (tag) in file properties. We have specifically confirmed compatibility with the following classification tools:

  • Microsoft Azure Information Protection
  • Boldon James
  • Tukan GREENmod

What are classification identifiers

  • Classification identifier specifies the classification type in general and is required.
  • Tag identifier specifies the classification’s more specific parameter and is optional.
  • The Identifiers are regexes checkbox is an optional setting for cases when you want to evaluate the identifiers not as specific strings but as regexes. It allows you to search for regular expressions instead of specific strings.
You can specify both the Classification identifier and the Tag identifier or leave one of them empty for a more general search.

You can obtain the identifiers either from sample classified documents or from Azure Information Protection.

Examples of various identifiers:

Third-party technology

Classification identifier

Tag identifier

Microsoft MIP sensitivity labels

0034f115-2835-4348-b421-de66a63e347f

 

Boldon James

DLPTRIGGER

[*{Internal}*]

Tukan GREENmod

TukanITGREENmodCATEGORY

RESTRICTED

For MIP sensitivity labels, the Tag identifier may be left empty.

 

 

How to obtain classification identifier from sample documents

If you have access to sample classified documents from your company, you can use them to find the classification identifiers:

  1. Open the document(s) in Office suite.
  2. Click File > Info > Properties > Advanced properties.
  3. Under the Custom tab, you will find various properties.
  4. Identify the one that is common for your classified files. Then simply copy its name (or a part of it) into the Classification identifier field. Optionally, copy its “value” into the Tag identifier field.

 

 

How to obtain classification identifier from Azure Information Protection

If you’re using Azure Information Protection classification, you can easily obtain the required information from the Azure AD admin center.

  1. In the admin center, go to the Azure Information Protection section.
  2. There, in the Policies section, you will find all existing policies:

3.  Choose the policy you need detected by Safetica and open it. On the very bottom of the configuration window, you will find your Label ID:

4.  For Safetica to register your files labeled with Microsoft MIP sensitivity labels, you need to enter into the Classification identifier field one of the following:

  • MSIP_Label_ - if you enter this string, Safetica will detect all files classified by MIP, regardless of label ID.
  • The specific label ID - e.g. cf8068d2-8761-4163-baee-5442b203479c - Safetica will detect files classified by MIP as “Confidential” (as per the example above).

For MIP sensitivity labels, the tag identifier may be left empty.

 

Read next:

Data classification in Safetica

Data classification: What is Safetica unified classification

Policies: How they work in Safetica