Data Labeling is transferring human knowledge to the computer by annotating data. In Data Labeling a raw media element (image, text, video, 3d, audio, etc.) is loaded along with a set of labels (Schema). The user reviews the media and labels it. For example, declaring a selected region of an image to be valid or invalid. 

This is like capturing the user’s knowledge, the intent, without specifying “how” they arrived at the conclusion. For example, if we label a “bird”, we aren’t telling the computer what makes up the bird, only that it is a bird.

What is Data Labeling?

Data Labeling produces structured data; ready to be consumed by a machine learning model (Data Science team). This is required because raw media is considered to be unstructured, meaning not readable by machine learning. This means Data Labeling is required for most modern machine learning use cases including computer vision, natural language processing, and speech recognition. 

What is Data Labeling
Fig 1.1: What is Data Labeling? (Also known as Training Data or simply Annotation)

What Makes Data Labeling Difficult?

The apparent simplicity of Data Labeling hides the vast complexity and volume of work involved:

  • Experts may be required for data labeling
  • There may be a voluminous amount of data labeling work
  • Lack of awareness, access or familiarity to the right data labeling tools 
  • As a new art form general data labeling ideas and concepts are not well known
  • Label Schemas may be complex with thousands of elements, nested conditional structures
  • Media formats impose challenges like series, relationships, 3D navigation
  • The knowledge task itself may be difficult with unclear answers
  • Most automation tools introduce new challenges and difficulties

Annotation is to data labeling as typing is to writing. Understanding how to annotate does not mean you can write a novel! Being able to annotate is becoming a basic component of modern literacy. 

More broadly Data Labeling (Training Data) is the art of supervising machines through data. Data Labeling is a new paradigm upon which a growing list of mindsets, theories, research, and standards are emerging. This involves technical representations, people decisions, processes, tooling, system design, and a variety of new concepts specific to it. Now that you know What is Data Labeling and it’s difficulties let’s think about media types.

What Media Types Work with Data Labeling?

Any media type can be used. Popular types include Images, Videos, Text, Audio, Timeseries, 3D, and more. There are many lateral supports, including Ingestion, Storage, Workflow, Automations, Exploration, Prioritization, Debugging, and more. These supports are often conceptually similar for each media type.

Fig. 1.2 Map of Overall Landscape of Training Data

What are common types of Labels in Data Labeling?

  • Images: Box, Polygons, Lines, Keypoints, Classification, Curves, Cuboids, Segmentation.
  • Videos: All of Image types, Series, Events 
  • Text: Named Entity Recognition, Part of Speech Tagging, Coreference Resolution, Dependency Parsing
  • 3D: Cuboids

Other types include: Audio, Timeseries, GEO & SAR, DICOM and more.

Why does Data Labeling Matter?

First, the art of teaching machines is at the heart of the 4th industrial revolution.

Therefore, it’s one of the most important software technologies of our time.

Data labeling is required to create your machine learning system. It determines what the system can do. Without data labeling, there is no system. With data labeling, the opportunities are only bounded by your imagination. Anything that you can map into the system can repeat with new data. Meaning the intelligence and ability of the system depends on the quality, volume, and variety of data you can teach it.

Second, data labeling work is upstream, before Data Science work. This means Data Science is dependent on data labeling. Errors and failures in data labeling flow down to Data Science. Or to use the more crass cliche – garbage in, garbage out.

Training Data vs Data Science Diffgram
Conceptual position of data labeling and data science – this helps zoom out from what is data labeling to how it’s used.

Third, The Art of data labeling represents a shift in thinking about how to build AI systems. Instead of exclusively trying to improve mathematical algorithms, in parallel, we optimize the structure of the data labeling to better match our needs. This is the heart of the AI Transformation taking place and the core of modern automation. For the first time knowledge work is now being automated.

How can Data Labeling be done Economically?

You know What is Data Labeling but how can it be done effectively and economically?

Data Labeling Costs Diffgram
  • Education. Framing the problem with optimal media types, label schema, and data types. This means research and education for the admins, managers, data scientists, engineers etc.
  • Learning Tools. Knowing data labeling tools inside and out. Functions like automations, segmentations, issue tracking, hotkeys, etc. all add up.
  • Rethink Annotation Talent. In-Source. Get your annotations from staff at your company. From data entry to experts your existing people are often the best annotators. Don’t assume that you need a dedicated outsourced team.
  • Learn Automations. Pre-labeling data (and reviewing production predictions) is the most popular and successful automation strategy. Next craft advanced automations strategies like interactive automations.
  • Use Open Source Tools. Data Labeling tooling cost, especially from closed source companies, is expensive. With Diffgram open source you can label for free up to 20 users. The Diffgram Unlimited Model frees you from paying for every annotation, every media element, every frame, every pixel! 

Why are Best Practices Data Labeling?

Data Labeling Best Practices Diffgram
  • Use Open Source. Use standards based data labeling.  Diffgram is leading the industry in setting the de facto new standard. In fact Diffgram is the only modern platform that’s fully open source.
  • Appoint a Data Labeling Leader. All revolutions need leaders. Someone to preach the new message. To rally the troops. To reassure doubts. And the leader must have a team. 
  • Task Management. Use Task Management concepts, quality assurance tools for data labeling.
  • Database for Data Labeling. Setup one place to store and query labeled data. 
  • Interactive Automations Running Locally. Run automations locally, such as with javascript, instead of making API calls. This is more cost effective and better performance.

What are Anti-Patterns with Data Labeling (2022)?

Anti-patterns in Data Labeling
  • Consensus. It doubles or triples the cost. It is not necessary in most cases. If there is that much disagreement among humans then the way the labels are structured is wrong. Instead use a Review loop, which can also be arbitrary so not every sample needs to be reviewed.
  • Closed Source Labeling Tools. Increase costs and are no longer needed now that open source software like Diffgram is available.
  • Over focus on automation. Automation matters. It’s only one part of the puzzle.

How Can I get started with Data Labeling?

The first step is to choose a platform.

Thanks for Reading!

Recap: In this article we covered “What is Data Labeling” and other data annotation topics. This is all in the context of machine learning / AI / ML.