What’s Inside the Data Loss Prevention System?

Data Loss Prevention (DLP) solutions were earlier used mostly to protect against data breaches. Today, the situation has changed. Modern technologies are developing not only expansively but also intensively. It means the DLP tools started to grow in depth where their creators focus on improving data interception and analysis. Information received by DLP solutions becomes particularly important to make business decisions. InfoSec tools like DLP turn into additional services for many business units from accounting to HR.

Scope of DLP solutions

True, an ounce of prevention is worth a pound of cure. DLP, of course, is, first and foremost, designed to prevent. Can data loss prevention measures leverage no analysis? In theory, yes, it can. In practice, if it follows this approach, the restrictions and constraints are going to be excessive. A big business cannot survive if it adopts an absolute prohibition policy. DLP analysis helps to select special entities and processes to be restricted. The selective approach to blocking dominates in DLP.

DLP system constantly monitors and intercepts different types of content. It marks and arranges the content. Templates and labels turn the bulk of information you hold into a searchable system. Otherwise, any search request will have to process all the intercepted data. This might take too long and fail to return appropriate results.

Let us say you are going to search for a credit card number in your DLP dump. A credit card number consists of 16 digits. However, due to varying formatting, it can be written with, full-text queries are likely to return not all or no matches. If you label different formatting options with a “credit card” tag and apply standard forms, the search will be successful. Your search processes credit card data only. The standard form will later clean any formatting and store any data as text. Assigned with the “credit card” tag, the captured number is listed in your database.

A DLP system also reviews event chains. This gives way to User Behavior Analytics (UBA) tools. UBA utilities explore the events spawned by users, evaluating the user’s behavior. Appropriate classification of events enables early detection of both non-compliance and exposure of devices to malware.

For instance, you can see how likely your staff member is to quit by forming even chains. Such an event chain may include – an employee sends his resume by email, visits an employment website, or contacts potential employers.

Data formats to deal with

Data is available in many representations. Archives save a huge amount of memory. Office files combine complex markup, pictures, text units, and other auxiliary items.

Fast handling of information requires instant availability of data for processing. To prevent serious damage, cybersecurity requires ever quicker actions to be taken. For that purpose, DLP comes up with format-specific data retrievers. These retrievers derive primitives from any data formats your business might use, such as databases, pics, text files.

Needless to say, data laid down as plain text works best for any kind of analysis. Optical Character Recognition (OCR) is widely used in DLP to transform image files into text. Up-to-date machine vision systems process pics in a breeze providing lots of relevant and searchable data.

As they became available for examination in the structured format, the vector graphics lately have drifted to their unique data primitive.

The odds are that the upcoming IT developments will enable us to retrieve comprehensive textual details of all data types.

Three ways to analyze DLP data

1. Semantic

This method typically uses a classifier. When there is no exact sample to search against, the semantic search detects classes of information across the data to be analyzed.

2. Formal

This approach seeks to establish data patterns and forms rather than semantics. Regular expressions is a common implementation of this method.

3. Sample-driven

 As its name suggests, this technique sets a sample to be found. It uses one or more of such inputs to detect the targets across the searchable data primitives.

Assigning to a class

Where your data has distinct values, it can be assigned to a certain category or class of information based on those values. Pics had not been subject to this assignment until recently. Progress in IT and growing computer capacity enabled assigning classes to images, too.

DLP only adopts new methods as long as they seriously enhance the output both in terms of the quality and processing time. Data processing cannot wait where security is at stake. A late response might be to no avail. The number of events a data leak prevention system usually deals with exceeds a million a day. Present-day security principles do not allow any delays as damages anticipated are huge.

A labeled training set powers data classification. The DLP system attributes each tracked file to one or more of its established categories. File folders on your computer are an example of such a system. The classifier gets trained as follows: first, the files in the collection undergo a kind of sampling that selects their distinct traits. For example, in pics, it searches for distinctive points; in docs, it looks for keywords and terminology. The training is based on the traits established. A trained classifier is ready to process the data stream.

Businesses in the same industry tend to differ in lexicons they stick to no matter that they describe the same subject matter. They also use different data formats and types. This implies that companies cannot use the same classifier. DLP systems operators must train their classifiers for each company individually. As classes, distinct traits and data types may change, your classifier should also be re-trained in the future to incorporate all the updates.

When it comes to text formats, there are many machine learning developments such as logistic regression and cosine similarity.

“In the beginning, there was the Word.” DLP uses words as distinct traits. For each word (morpheme), languages have sets of forms (lexemes). Morphemes tend to remain unchanged. Classifiers do not search for lexemes. They work with morphemes where all of them are brought to a normal form. Morphological dictionaries contribute best to the classification of the textual data. Otherwise, the classifier can only process specific word forms. Another way to improve the system performance is misspelled word detection and correction.

Fuzzy matching

Fuzzy matching (also known as copyright analysis) is used to look for parts of your reference sample in the data to be analyzed. Fuzzy matching splits into techniques specific to the data type it deals with. However, each such technique implements similar workflows. DLP uses the samples set as references to find matches among the data items it captures. While each fuzzy match method targets one data type only, the DLP system can handle a great number of reference samples. You can set a million files as references for fuzzy matching.

Let us take a look at the most common fuzzy matching methods.

1.          If you set a text file as a reference and work exclusively with primitives, doing a classical copyright analysis. The DLP algorithm calculates the proportion of tracked items matching certain fragments of one or more reference samples. It shows the relevance of intercepted docs. It also highlights the matches in the graphical interface.

2.          Binary data is also available for classic fuzzy matching. It is understood that for binary data, there is no exact text comparison. It determines only the relevance.

3.          Raster graphics are eligible for fuzzy matching too. In this case, the performance critically depends on setting a feasible speed/quality ratio.

4.          Fuzzy matching also processes vector graphics. It picks up the primitives and compares the in-image position against the samples set as references. You can configure most DLP systems to retrieve parts of vector images.

5.          Dedicated fuzzy matching comes into play where you deal with a specific issue that occurs often enough. Various forms \ surveys are an ever-growing business asset. For instance, you may want to be notified when the document is a questionnaire. You can set a blank template as a reference sample to detect its fuzzy matches among the tracked files. The DLP system can retrieve answers from analyzed questionnaires.

6.          Another popular implementation of fuzzy matching analyzes graphical data where seals and stamps are set as reference samples.

7.          With fuzzy matching, you can even find a picture that is a part of another picture. You can detect credit cards not only by 16 digits but by a payment system logo.


Data loss prevention systems have become an indispensable part of business IT infrastructure. However, to get the most from a DLP tool, every customer should do his best to adjust a DLP system to their specific needs. Provider engagement in this fine-tuning is critical.

Demand for data loss prevention is growing and, what is even more important, changing. This presents new challenges as new types of data, events, and communication channels require enhanced security. As ever more people work remotely the demand for on-premises and cloud DLP is growing dramatically.

The DLP market has evolved greatly both in terms of the systems’ performance and their analytical capabilities. Features of the products made available in the market include, but are not limited to, tracking and reviewing staff liaisons with third parties, visual representations of such relations, detecting odd employee behaviors, determining informal corporate links, responding to challenges and emergencies beforehand.

DLP solutions have been developing since the early 2000s. Their market offers a wide variety of products. At the same time, rumors have it that the game is over as there is no room for further growth. Do not fall for it as we see that data loss prevention is not limited to cybersecurity. Corporate and private users leverage its functionality to address a variety of new business issues.

David Balaban