← Back to Learn
Guide10 Feb 2025· 5 min read

Trainable classifiers vs Sensitive Information Types - and why you should use both

Information ProtectionData Loss PreventionData Lifecycle Management

SITs match patterns. Classifiers match content types. On their own they are useful. Together they dramatically reduce false positives and strengthen your auto-labelling, DLP, and retention policies.

What Sensitive Information Types do

Sensitive Information Types match patterns in content. A credit card number, a National Insurance number, a passport number - these all follow predictable formats. SITs use a combination of regular expressions, keyword proximity, checksums, and confidence levels to find these patterns.

The built-in SITs cover hundreds of data types across dozens of countries. You can also create custom SITs for your own data formats - employee IDs, project codes, internal reference numbers.

Strengths: Very precise for structured data. A credit card number either matches the pattern or it does not. Low false negatives - if the data is there, the SIT will find it.

Weakness: False positives. A 16-digit number in a spreadsheet might match a credit card pattern but actually be a product code. A string that looks like a National Insurance number might be a random reference. SITs match the pattern without understanding what the document is about.

What trainable classifiers do

Trainable classifiers match content types, not patterns. Instead of looking for specific strings, they analyse the overall structure and language of a document to determine what kind of content it is.

Microsoft provides pre-trained classifiers for common content types: financial statements, resumes, source code, tax forms, contracts, invoices, legal affairs, healthcare records, and more. These have been trained on large datasets and work out of the box.

You can also train custom classifiers on your own data. Provide examples of the content type you want to detect (at least 50 positive samples), and the classifier learns the patterns that make that type of document distinctive.

Strengths: Catches content that does not contain obvious patterns. A board pack might not have credit card numbers in it, but a classifier can recognise it as a financial document based on its structure and language.

Weakness: Less precise than SITs for specific data. A classifier might flag a document as "financial" when it is actually a financial training manual, not a real financial report. It understands the type of document, not whether the data inside it is genuinely sensitive.

When to use each on their own

Use SITs when you need to detect specific, structured data regardless of context. Credit card numbers in any document. Passport numbers anywhere they appear. National IDs in emails.

Use classifiers when you need to detect a category of content without knowing exactly what data it contains. All documents that look like contracts. Any file that appears to be source code. Emails that read like legal correspondence.

Quick decision: If you can describe what you are looking for as a data format ("16-digit numbers starting with 4"), use a SIT. If you can describe it as a document type ("anything that looks like a financial statement"), use a classifier.

Why combining them changes the game

On their own, SITs cast a wide net and classifiers make broad judgements. Together, they are far more accurate than either one alone.

The logic works like AND conditions in a rule. Instead of "block any document containing credit card numbers" (lots of false positives) or "block any document that looks financial" (too broad), you get:

"Block documents that look like financial statements AND contain credit card numbers."

That second rule is dramatically more targeted. A spreadsheet with a random 16-digit product code will not match because it does not look like a financial statement. A financial training manual will not match because it does not contain actual credit card numbers. You only catch documents that are genuinely sensitive.

This works in any Purview policy that supports condition groups:

Auto-labelling: Label a document "Confidential - Financial" only when it is classified as a financial document AND contains bank account numbers or credit card data. Without the classifier, you would label every document with a bank account number in it, including invoices you send to customers.

DLP: Block external sharing of documents classified as source code AND containing API keys or connection strings. Without the classifier, you would block any document with a string that looks like an API key, including documentation about how to use APIs.

Retention: Retain documents classified as contracts AND containing personally identifiable information for 7 years. Without the classifier, you would retain every document with PII in it, which is practically everything.

Practical examples

HR data protection

Classifier: "Resumes" or "HR documents". SIT: National Insurance numbers, dates of birth, salary figures. Rule: block external sharing of HR documents containing personal identifiers.

Why it works: a document mentioning someone's date of birth in a project plan will not trigger. An actual CV with personal details will.

Intellectual property

Classifier: "Source code" or custom classifier trained on your proprietary documents. SIT: custom SIT for internal project codes, API keys, connection strings. Rule: block upload to personal cloud storage of source code containing internal identifiers.

Why it works: a developer sharing a public code snippet is fine. Uploading proprietary source code with embedded credentials is caught.

Regulatory compliance

Classifier: "Financial statements" or "Tax forms". SIT: credit card numbers, bank account numbers, tax identification numbers. Rule: auto-label as "Highly Confidential" and restrict to internal only.

Why it works: financial training materials or public filings will not be over-classified. Actual sensitive financial documents with real account data will be protected.

Getting started

1. Audit your existing SIT-only policies. Look at your false positive rate. If you are getting alerts for documents that are technically matching but not actually sensitive, adding a classifier condition will help.

2. Start with the pre-trained classifiers. Microsoft provides dozens out of the box. Test them against your real content before building custom ones.

3. Use condition groups in your rules. In the DLP rule builder, add a condition group with an AND operator. Put the classifier in one group and the SIT in another.

4. Test in audit mode first. Run the combined rule alongside your existing SIT-only rule for a couple of weeks. Compare what each catches. You should see a significant drop in false positives from the combined rule.

5. Build custom classifiers for your specific content. If you have document types unique to your organisation (internal memos, client reports, proprietary formats), train a classifier on them. This is where the real value is - detecting content types that no built-in classifier covers.

0 comments

Comments

No comments yet. Be the first to share your experience.