In the embracing face, I noticed a trend in concerns for machine learning (ML) datasets hosted in the hub. This is undocumented personal information about an individual. This poses some unique challenges for ML practitioners. In this blog post, we will explore different types of datasets, including Personal Identification Information (PII), the issues presented, and the types of personal information known as new features that we are experimenting with the Dataset Hub to address these challenges.
Types of datasets using PII
I noticed two different datasets, including PII.
Annotated PII Data Sets: Datasets such as the PII-Masking-300K with AI4Privacy are specially designed to train PII detection models and are used to detect and mask PII. For example, these models can help moderate online content or provide an anonymized database. Pre-training datasets: These are large datasets, often terabytes in size and are usually obtained by web crawling. These datasets are usually filtered to remove certain types of PII, but small amounts of sensitive information can slip through the cracks due to the amount of data and flaws in the PII detection model.
The challenges of PII in ML datasets
The presence of PII in the ML dataset can pose several challenges for practitioners. First and foremost, it can be used to raise privacy concerns and infer confidential information about individuals. Furthermore, PII can affect the performance of ML models if not properly processed. For example, if your model is trained on a dataset that contains PIIs, you can learn to associate a particular PII with a particular outcome and generate PIIs from biased predictions or training sets.
New experiments on dataset hubs: Presidio report
To address these challenges, we are experimenting with new capabilities in a dataset hub that uses Presidio, an open-source cutting-edge PII detection tool. Presidio relies on detection patterns and machine learning models to identify PIIs.
This new feature allows users to view reports that estimate the presence of PIIs in the dataset. This information is valuable to ML practitioners and can help them make informed decisions before training the model. For example, if the report indicates that the dataset contains sensitive PII, practitioners may choose to use tools such as Presidio to further filter the dataset.
Dataset owners can also benefit from this feature by verifying the PII filtering process before using reports to release the dataset.
Presidio Report Example
Let’s take a look at an example Presidio report for this pre-training dataset.
In this case, Presidio detected a small amount of email and a sensitive PII in the dataset.
Conclusion
The existence of PII in ML datasets is an evolving challenge for the ML community. As we embrace our faces, we are committed to transparency and helping practitioners navigate these challenges. By experimenting with new features like Presidio reports on dataset hubs, we want to enable users to make informed decisions and build more robust and ethical ML models.
We would also like to thank CNIL for their assistance with GDPR compliance. Their guidance is invaluable in navigating the complexity of AI and the issues of personal data. Check out the updated AI how-to sheet here.
Stay tuned for the latest updates on this exciting development!