Graph‑Based Semi‑Supervised Learning is a machine learning approach that uses both labelled and unlabelled data by modelling their relationships as a graph. Here, data points are treated as nodes and edges represent similarities, allowing labels to propagate through the graph based on structural connections.
- Effectively uses limited labelled data
- Uses relationships and similarity between data points
- Labels spread through the graph from labelled to unlabelled nodes
- Works well when labelled data is limited
- Commonly applied in text classification, image recognition and recommendation systems

Need for Graph-Based Learning
In many real-world scenarios, labelling data is expensive and time consuming. At the same time, unlabelled data is often abundant. So graph-based methods offer an useful way to combine both:
- Captures relationships between data points
- Uses unlabelled data to improve accuracy
- Reduces reliance on large labelled datasets
How It Works
In Graph-Based Semi-Supervised Learning, the dataset is first represented as a graph where each data point becomes a node. Nodes that are similar to each other are connected through edges. These connections allow the model to use the structure of the data to spread label information from a small set of labelled nodes to many unlabelled ones.
The learning process mainly happens in two stages:
1. Graph Construction
In this step, the graph structure is created to represent relationships among data points.
- Each data point is represented as a node in the graph.
- Edges are added between nodes that are similar to each other.
- Similarity can be measured using distance metrics, k-nearest neighbours or other similarity functions.
- Edge weights indicate how strong the relationship between two nodes is.
- Stronger connections mean higher influence during label propagation.

2. Label Propagation
Once the graph is built, labels are spread across it using the connections.
- The process starts with a small number of labelled nodes.
- Labels of these nodes are kept fixed throughout learning.
- Unlabelled nodes receive label information from their neighbouring nodes.
- Nodes with stronger connections have a greater influence on label assignment.
- This process is repeated iteratively until the labels stabilize and no longer change.

Types of Graph-Based Semi Supervised Learning Methods
Different types of Graph-based semi-supervised learning exist like:
Graph Regularization Methods
These methods enforce smoothness over the graph.
- Assume connected or similar nodes should have similar labels.
- Use smoothness constraints to prevent sudden label changes across neighbours.
- Often based on graph Laplacian or energy minimization.
- Common examples include label propagation and label spreading.
- They are simple and effective and works well when data naturally forms clusters.
Graph Embedding Methods
These methods convert nodes into vector representations.
- Learn embeddings that capture graph structure and similarity.
- Nodes with similar embeddings are likely to share labels.
- Embeddings can be used with traditional classifiers or deep learning models.
- They helps to reduce dimensionality.
- Makes graph data easier to use in downstream tasks.
Implementation
Lets implement Graph-Based Semi-Supervised Learning using python:
Step 1: Import Required Libraries
We will import the necessary libraries such as numpy, scikit learn, matplotlib, LabelPropagation and networkx for our model
import numpy as np
import matplotlib.pyplot as plt
from sklearn.semi_supervised import LabelPropagation
from sklearn.neighbors import kneighbors_graph
import networkx as nx
Step 2: Create Sample Data
This block defines the input data and labels. Two main clusters exist:
- One near (2,2): class 0
- Another near (8,8): class 1
- In X we have Feature data (each row is a data point)
- In y we have Labels (-1 means unlabelled)
- Many samples are marked as -1, meaning they are unlabelled.
X = np.array([
[1, 2],
[1, 3],
[2, 2],
[2, 3],
[3, 2],
[4, 4],
[5, 5],
[6, 6],
[8, 8],
[8, 9],
[9, 8],
[9, 9]
])
y = np.array([
0, -1, 0, -1, -1,
-1, -1, -1,
1, -1, 1, -1
])
Step 3: Construct Graph Using Nearest Neighbours
This block converts the dataset into a graph.
- Each data point becomes a node.
- Edges connect points to their 3 nearest neighbours.
- The adjacency matrix (A) stores connectivity information.
- NetworkX converts this into a visual graph structure.
A = kneighbors_graph(X, n_neighbors=3, mode='connectivity', include_self=False)
G = nx.from_scipy_sparse_array(A)
pos = {i: X[i] for i in range(len(X))}
Step 4: Visualize Graph Before Label Propagation
This visualization shows the initial state of the graph.
- Red nodes: class 0
- Blue nodes: class 1
- Gray nodes: unlabelled
plt.figure(figsize=(6,6))
before_colors = []
for label in y:
if label == 0:
before_colors.append("red")
elif label == 1:
before_colors.append("blue")
else:
before_colors.append("gray")
nx.draw(G, pos, node_color=before_colors, with_labels=True, node_size=500)
plt.title("Before Label Propagation (Gray = Unlabeled)")
plt.show()
Output:

Step 5: Apply Label Propagation Model
This block performs semi-supervised learning.
- The algorithm uses graph connections to spread labels.
- Labelled nodes act as sources of information.
- Nearby unlabelled nodes receive labels based on similarity.
model = LabelPropagation(kernel='knn', n_neighbors=3)
model.fit(X, y)
predicted_labels = model.transduction_
print("Predicted labels:", predicted_labels)
Output: Predicted labels: [0 0 0 0 0 0 0 0 1 1 1 1]
Step 6: Visualize Graph After Label Propagation
This shows the final result after learning.
- Previously gray nodes now receive predicted labels.
- The graph visually demonstrates how labels spread across connected regions.
- Comparing before vs after makes the learning process intuitive.
plt.figure(figsize=(6,6))
after_colors = ["red" if label == 0 else "blue" for label in predicted_labels]
nx.draw(G, pos, node_color=after_colors, with_labels=True, node_size=500)
plt.title("After Label Propagation")
plt.show()
Output:

You can download the source code from here
Applications
Graph-based semi-supervised learning is widely used in areas where data points are related or connected like:
- Text Classification: Documents are linked based on similarity, allowing labels to spread to unlabelled text such as articles or emails.
- Image Recognition: Similar images are connected in a graph, helping label images even when only a few are manually annotated.
- Social Network Analysis: Users are treated as nodes and relationships help predict interests, communities or behaviours.
- Bioinformatics: Genes or proteins are connected based on biological similarity to assist in function prediction.
- Recommendation Systems: Users and items are linked through interactions, enabling better recommendations using limited labelled data.
Limitations
Graph-Based Semi-Supervised Learning still has some limitations like:
- Graph Construction: Creating graphs that correctly capture real relationships between data points is still difficult.
- Label Propagation Accuracy: Making label spreading more stable and reliable, especially with noisy data, needs more work.
- Integration with Deep Learning: Combining graph-based methods with neural networks is an active research area.
- Scalability and New Domains: Applying GSSL to very large datasets and new fields remains challenging.