Anatomy of a custom sensitive information type
Employee IDs, customer reference numbers, project codenames: the data that matters most to your organisation has no built-in classifier. Here is how a custom SIT is actually constructed, part by part, with a worked example.
When you need one
Purview comes with over 300 built-in sensitive information types covering credit cards, passports, national IDs, and bank accounts worldwide. None of them know what your employee ID looks like.
The data that identifies your organisation specifically - staff numbers, customer references, internal account formats, unannounced project codenames - needs a custom SIT. The good news: if you can describe the data as a pattern, you can build one in an afternoon. The bad news: most custom SITs are built badly, because the parts are misunderstood.
The four parts of a pattern
A custom SIT contains one or more patterns, and each pattern has four parts.
1. Primary element. The trigger. A regular expression ([A-Z]{2}\d{6} for two letters and six digits), a keyword list, or a keyword dictionary. The primary element finds candidate matches; everything else decides whether to keep them.
2. Supporting elements. Context that validates the match. The string EM483920 could be an employee ID or a flight reference. The words 'employee id' or 'payroll' nearby are what settle it.
3. Character proximity. How close the supporting evidence must be, in characters either side of the primary match. The default is 300. It is a hard window, and the entire supporting element must sit inside it.
4. Confidence level. A label - low (65), medium (75), or high (85) - that you assign to the pattern. Policies use it to decide which matches to act on. It is not a quality score, and Purview never checks whether the pattern earns it.
Additional checks and validators
Beyond the four parts, each pattern can carry validation on the matched value itself:
- Exclude specific matches - drop known test values, like the 4111111111111111 test card
- Starts with / doesn't start with - require or reject leading characters
- Ends with / doesn't end with - the same for trailing characters
- Exclude duplicate characters - ignore matches where all the digits are identical, like 111111
- Include or exclude prefixes and suffixes - reject a match preceded by GUID:, for example
- Checksum validators - run a Luhn check, or a custom weighted modulo checksum for ID schemes that carry a check digit
- Date validator - confirm an embedded date segment is a real date, useful when an ID starts with a hire date in DDMMYY format
These checks are the difference between detecting '16 digits' and detecting 'a number that is mathematically a card number'. Use them.
One more lever, and it is the best false positive tool of the lot: 'not any of these' exclusions. A group of terms that suppress the match when found nearby. Microsoft uses the trick itself: the built-in ITIN type (a US taxpayer number for people who cannot get a Social Security Number - nine digits, always starting with 9) ignores matches that follow 'Phone Conference ID:', because Teams meeting invites kept lighting it up. Your custom SITs can do the same with whatever noise plagues your tenant.
A worked example
Employee IDs in the format two letters plus six digits, like EM483920. Build three patterns in one SIT:
Low confidence: regex [A-Z]{2}\d{6} with the duplicate-digit exclusion. Catches everything, including flight numbers. You will not block on this, but in simulation it shows you the shape of the noise.
Medium confidence: the same regex, plus supporting keywords (employee, staff, payroll, HR) within 300 characters.
High confidence: the same regex, plus the specific phrases 'employee id', 'employee number', or 'staff id' within 100 characters, plus excluded test values.
Now DLP rules can route by tier: audit the lows, notify on medium, block on high. And when a real employee ID slips past the high pattern, the medium tier shows you why - usually a proximity distance you can see and fix.
Test before you deploy
Two practical notes on testing.
Test in the portal before any policy references the SIT. The SIT editor has a built-in test that accepts a sample file and shows match counts per confidence level. One gotcha: the tenant needs at least one Exchange Online license or the Test option is greyed out.
Design and iterate outside the tenant first. The portal feedback loop is slow when you are still shaping regex and proximity values. The Custom SIT Builder simulates Purview's evaluation - primary, supporting, proximity windows, checks, and the highest-match-wins result - instantly in your browser, flags weak pattern design as you build, and exports a specification you can follow in the portal. Its engine is built to mirror how Purview evaluates content, and if a built-in SIT is close to what you need, open it in the SIT X-Ray and use Open in Builder to start from Microsoft's own patterns.
Design the patterns, test against sample text live, and export a portal-ready specification.
Open the Custom SIT BuilderPlan this in a tool
Free planners to design and test this before you deploy. No login.