What is the difference between labeled and unlabeled data?

Labeled data, used by Supervised learning add meaningful tags or labels or class to the observations (or rows). These tags can come from observations or asking people or specialists about the data.

Classification and Regression could be applied to labelled datasets for Supervised learning.

Machine learning models can be applied to the labeled data so that new unlabeled data can be presented to the model and a likely label can be guessed or predicted. enter image description here

Unlabeled data, used by Unsupervised learning however do not have any meaningful tags or labels associated with it. enter image description here Unsupervised learning has more difficult algorithms than supervised learning since we know little to no information about the data, or the outcomes that are to be expected.

Clustering is considered to be one of the most popular unsupervised machine learning techniques used for grouping data points, or objects that are somehow similar.

Unsupervised learning has fewer models, and fewer evaluation methods that can be used to ensure that the outcome of the model is accurate. As such, unsupervised learning creates a less controllable environment as the machine is creating outcomes for us.

Picture courtesy of Coursera: Machine Learning with Python


Labeled data is a group of samples that have been tagged with one or more labels. Labeling typically takes a set of unlabeled data and augments each piece of that unlabeled data with meaningful tags that are informative. For example, labels might indicate whether a photo contains a horse or a cow, which words were uttered in an audio recording, what type of action is being performed in a video, what the topic of a news article is, what the overall sentiment of a tweet is, whether the dot in an x-ray is a tumor, etc.


There are many different problems in Machine Learning so I'll pick classification as a case in point. In classification, labelled data typically consists of a bag of multidimensional feature vectors (normally called X) and for each vector a label, Y which is often just an integer corresponding to a category eg. (face=1, non-face=-1). Unlabelled data misses the Y component. There are many scenarios where unlabelled data is plentiful and easily obtained but labelled data often requires a human/expert to annotate.


Typically, unlabeled data consists of samples of natural or human-created artifacts that you can obtain relatively easily from the world. Some examples of unlabeled data might include photos, audio recordings, videos, news articles, tweets, x-rays (if you were working on a medical application), etc. There is no "explanation" for each piece of unlabeled data -- it just contains the data, and nothing else.

Labeled data typically takes a set of unlabeled data and augments each piece of that unlabeled data with some sort of meaningful "tag," "label," or "class" that is somehow informative or desirable to know. For example, labels for the above types of unlabeled data might be whether this photo contains a horse or a cow, which words were uttered in this audio recording, what type of action is being performed in this video, what the topic of this news article is, what the overall sentiment of this tweet is, whether the dot in this x-ray is a tumor, etc.

Labels for data are often obtained by asking humans to make judgments about a given piece of unlabeled data (e.g., "Does this photo contain a horse or a cow?") and are significantly more expensive to obtain than the raw unlabeled data.

After obtaining a labeled dataset, machine learning models can be applied to the data so that new unlabeled data can be presented to the model and a likely label can be guessed or predicted for that piece of unlabeled data.

There are many active areas of research in machine learning that are aimed at integrating unlabeled and labeled data to build better and more accurate models of the world. Semi-supervised learning attempts to combine unlabeled and labeled data (or, more generally, sets of unlabeled data where only some data points have labels) into integrated models. Deep neural networks and feature learning are areas of research that attempt to build models of the unlabeled data alone, and then apply information from the labels to the interesting parts of the models.