Skip to content
← Back to Learn
Gotcha11 Jun 2026· 5 min read

Five things Purview's SIT engine does that the docs never mention

Information ProtectionData Loss Prevention

Five behaviours of the Purview classification engine that the documentation does not mention, and now baked into the SIT X-Ray.

How this was tested

The SIT X-Ray needed to behave exactly like Purview, not approximately. That meant working from Microsoft's built-in SIT definitions (identical in every tenant) and checking the simulation against how Purview actually classifies crafted sample content, case by case.

Where Purview's behaviour disagreed with the documentation, the behaviour won, and the simulation was corrected to match. That process also surfaced five behaviours the docs never mention.

Keyword matching is case-sensitive when you least expect it

Keyword lists come in two styles, word match and string match, and the difference is bigger than the docs let on.

Word match is case-insensitive with boundaries - what you would expect.

String match is a case-sensitive substring. The IP Address SIT corroborates on the term 'IP' using string match. In testing, the word PIPELINE triggers it - uppercase IP sits right there inside the word - while a lowercase 'ip', or 'recipe', does nothing.

So a document containing 'PIPELINE node 10.2.1.4' matches the combined IP Address SIT at full confidence, while the same sentence in lower case loses that corroboration. (The bare address still matches the separate IP Address v4 SIT at medium - the case-sensitivity is specifically about the combined SIT's keyword.) If you have ever stared at two near-identical documents wondering why only one fired, this may be your answer.

The engine rejects values the docs never mention

The documented SSN rules are familiar: no area 000 or 666, no 900+ areas, no group 00, no serial 0000.

The engine also rejects serial 9999, which appears nowhere in the documentation. 219-09-4821 with an SSN keyword matches at high confidence; 219-09-9999 matches nothing, at any confidence.

The practical trap is test data. If your team validates DLP policies with made-up SSNs ending 9999 (a natural choice for fake data), every test will pass silently and prove nothing. The policy is fine; your test value was quietly rejected before evaluation started.

Instance counts dedupe by value

DLP rule conditions let you set instance count ranges, and most people assume the count is the number of matches in the document.

It is the number of distinct values. A document containing the same card number twice reports both occurrences but counts one instance. Two different card numbers count as two.

This changes how you design thresholds. A rule that fires at 'instance count 10 or more' is not asking for a document that mentions a card number ten times - it is asking for ten different card numbers, which is a much stronger signal of a data export. That is almost certainly what you wanted anyway, but now you know it is what you are getting.

Confidence values the portal never shows you

The portal offers three confidence levels: low (65), medium (75), high (85). Microsoft's own SITs are not so constrained.

Crack open the built-in definitions and you find patterns at 55 (the U.S. SSN type has one) and 95 (IP Address, when an ip keyword corroborates). The dropdown never offers these values, and the portal's Test feature reports results in bands capped at 85 - so a 95-confidence match comes back labelled plain 'high', and a 55 hides inside 'low'.

It matters because rules trigger at thresholds. That 55-confidence SSN pattern sits below the standard low band, so a rule set to 'low confidence' still catches it. You cannot see any of this in the portal; the SIT X-Ray shows every pattern with its real value.

The hidden exclusion list

The ITIN is a US taxpayer number issued to people who cannot get a Social Security Number: nine digits, always starting with 9. So why does the ITIN type ignore some numbers that fit that shape exactly? Because Microsoft quietly added a 'not any of these' exclusion: any candidate preceded by 'Phone Conference ID:' is suppressed. Teams meeting invites were lighting the SIT up, and this was the fix.

Two lessons. First, if a built-in SIT mysteriously will not match your test data, an exclusion may be eating it - the X-Ray shows these groups in red. Second, the same mechanism is available to you: when your custom SIT keeps firing on some recurring piece of tenant noise, an exclusion group is cleaner than loosening the pattern. This one shows up in Purview too: conference IDs are suppressed, and the same number with a taxpayer keyword matches at high confidence.

Every built-in SIT x-rayed: real patterns, confidence values, and exclusions, with live in-browser testing for the heavy hitters.

Open the SIT X-Ray

Plan this in a tool

Free planners to design and test this before you deploy. No login.