Scikit Library in Python

Scikit-learn (also written as sklearn) is a popular open-source Python library for machine learning. Built on top of NumPy, SciPy and Matplotlib, it provides efficient tools for data analysis and predictive modeling. It offers simple and reusable functions to build models for tasks such as classification, regression, clustering and dimensionality reduction.

Key Features of Scikit-learn

Below are several key features of Scikit-learn that make data preparation, modeling and evaluation simple and efficient.

1. Data Preprocessing: Preparing data is an important step in any machine learning project. Scikit-learn simplifies this process with built-in tools for:

Data Splitting: Divide data into training and testing sets.
Feature Scaling: Normalize or standardize feature values.
Feature Selection: Choose the most relevant features.
Feature Extraction: Create new features from existing data.

2. Model Evaluation: helps you check how well your machine learning model predicts and performs on data.

Metrics: Evaluate model performance (accuracy, precision, recall and F1-score).
Model Selection: Tools for selecting the best model hyperparameters through techniques like grid search and randomized search.

3. Pipeline Support: Combine preprocessing and modeling steps efficiently.

4. Integration: Works seamlessly with Python libraries like NumPy, Pandas and Matplotlib.

5. Ease of Use: Simple, consistent and user-friendly API for all tasks.

Installing and Importing Scikit-learn

To install Scikit-learn, use Python's package manager pip with the following command:

pip install scikit-learn

Once installed, import Scikit-learn modules into a Python script or environment using the import statement. For example:

import sklearn

Machine Learning Techniques Supported by Scikit-learn

1. Supervised Learning

Supervised learning involves training models using labeled data, where the correct output is already known.

Classification: Scikit-learn provides multiple algorithms to predict categorical outcomes, such as logistic regression, decision trees, random forests, support vector machines (SVMs) and gradient boosting.
Regression: Regression models are used to predict continuous numerical values. Scikit-learn supports linear regression, support vector regression and decision tree regression.

Example: Logistic Regression Algorithm

Logistic Regression is a supervised machine learning algorithm used to predict categories (yes/no, spam/not spam, disease/no disease). It works by estimating the probability of an outcome and is simple, easy to interpret and effective for problems where classes can be separated.

This example uses Logistic Regression to classify flowers in the Iris dataset and check how accurately the model predicts their types.

Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

y_pred = log_reg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Output

Accuracy: 1.0

2. Unsupervised Learning

Unsupervised learning works with unlabeled data to discover patterns and structure.

Clustering: Scikit-learn offers clustering techniques to group similar data points, including K-means clustering, DBSCAN and hierarchical clustering.
Dimensionality Reduction: To handle high-dimensional data efficiently, Scikit-learn provides techniques like principal component analysis (PCA).

Example: KMeans Algorithm

KMeans groups data into k clusters based on similarity. It is an unsupervised learning algorithm, ideal for tasks like customer segmentation, image compression and anomaly detection, especially when the underlying data structure is unknown.

This program demonstrates how to use KMeans clustering from Scikit-learn to group the Iris dataset into three clusters based on feature similarity.