← Back to Learn
Gotcha8 Feb 2025· 5 min read

Auto-labelling doesn't classify all your data at rest

Information Protection

You deploy auto-labelling and assume everything is covered. It is not. Service-side classification only processes recently active files. Years of historical data sits untouched. Here's what actually happens and what to do about it.

What happens

You build a trainable classifier. Simulation mode looks good. You switch to enforce. Within 24 hours, every PDF in three SharePoint sites gets labelled Confidential. Thousands of documents. Users cannot share files externally. Your helpdesk gets flooded.

Why this happens

Trainable classifiers work on content patterns, not file types. PDFs are particularly tricky because they contain extracted text that matches common patterns. A classifier trained on 'financial documents' will match any PDF with numbers, currency symbols, and formal language.

The real issue is scope. The default is to apply auto-labelling across all SharePoint sites and OneDrive accounts. If you do not scope it, it runs everywhere. Simulation mode can be misleading if you only review the first page of results.

It also does not cover everything you think it does

Service-side auto-labelling only processes recently active files. Files that have been created or modified recently are in what Microsoft considers active storage and get classified. Older files that have not been touched sit in cold storage and will not be picked up by your auto-labelling policies at all.

At the time of writing, Microsoft does not publicly document which specific files are considered hot or cold storage. There is no clear threshold for how recently a file needs to have been accessed or modified to qualify. This makes it difficult to predict exactly what your auto-labelling policies will and will not cover.

This catches people out. You deploy auto-labelling, it classifies recent documents, and you assume everything is covered. Months later you discover that years of historical files in SharePoint are completely unlabelled.

Microsoft introduced on-demand classification to address this. It scans inactive and historical files across SharePoint and OneDrive, but it is a paid feature with pay-as-you-go billing. You can scan up to 20 million files per run. It is worth it if you need full coverage, especially before enabling Copilot, but budget for it.

How to avoid it

Always scope auto-labelling policies to specific SharePoint sites or groups. Never run tenant-wide.

Run simulation for at least 14 days. Review the full match list, not just the first 10 results. Export and look for false positive patterns.

Raise the confidence threshold. The default is often too low. Start at 85-90% and lower gradually.

Layer conditions. Require both a classifier match AND a sensitive info type before auto-labelling.

Use 'recommend' instead of auto-apply. Users see a banner suggesting a label rather than having it forced.

If it already happened

You cannot bulk-remove sensitivity labels easily. Use PowerShell with the Security & Compliance module to remove labels site by site. Disable the auto-label policy first, then use simulation mode to identify everything it labelled.

Consider whether some labels were actually correct. Often the classifier was right about 60-70% of the content. The problem was the 30-40% false positives.

0 comments

Comments

No comments yet. Be the first to share your experience.