After examining ten of the frequently cited datasets for testing machine learning systems, MIT Computer researchers have found out that these data sets have critical labeling errors. 

These errors could cause deep problems for AI systems developed using them. The datasets tested for accuracy were cited hundreds of thousands of times. These citations include text-based newsgroups, IMDB, and Amazon. There were instances where errors emerged for even basic Amazon product reviews (negative reviews being labeled as positive and positive labeled as negative).

Image-based errors included mislabeling animal species, and labeling less-prominent objects (labeling a water bottle attached to a mountain bike instead of labeling the bike). 

For audio labeling, an audio clip of a YouTube video where the speaker was talking for 3.5 minutes was labeled as a “church bell” (Only the final 30 seconds had the bell sounds)

Pervasive Label Errors in ML Datasets Destabilize Benchmarks

How did the MIT researchers find these labeling errors?

Using the confident learning framework, researchers examined the datasets for irrelevant data (known as label noise). They found that about 54% of the data was flagged for incorrect labels when the possible mistakes were validated using Mechanical Turk. 

QuickDraw tests have concluded that about 10% of the dataset contained the most errors (around 5 million). Their team has even created a website so that everyone can browse these labeling errors.

Label Error counts and percentages across 10 popular ML datasets

What is the impact of using incorrectly labeled data sets for machine learning/AI?

Using mislabeled or incorrectly labeled data sets to test or train your ML/AI algorithms can have catastrophic effects. This can lead to AI giving out wrong predictions and can bring down the entire ML/AI process. 

How to prevent Mis-Labeling data or incorrect annotations for ML/AI?

The datasets you use to train your ML/AI are very important and will decide the overall quality of your ML/AI output. Hence it is very important to take great care of your training data and its operations. Some of the best practices include but are not limited to;

  1. Investing in the right team for training data labeling
  2. Closely monitoring and improving the data labeling practices in your organization
  3. Investing in the RIGHT training data software, vendors, and tools
  4. Make sure that only qualified personnel are labeling your training data sets

How can Diffgram’s Open Source Data Labeling Solution Help?

The quality of your training data decides the quality of your ML/AI’s outputs. Diffgram can help you manage almost EVERY aspect of your training data on a single platform.

  1. Manage your data labeling pipeline, flag errors, and have detailed review processes
  2. Allow ONLY QUALIFIED data annotators to label data. (Diffgram’s examination feature allows you to qualify and award credentials to your annotators)
  3. Diffgram’s open-source nature requires us to be the highest quality and set the standards for the market.
  4. Our UNLIMITED model makes sure that your organization is not experiencing high data labeling bills and annotate freely and with high quality.
Example of Video labeling on Diffgram

Try Diffgram’s Online Platform today (no cc required) or install our open-source software today.