Author: saqibkhan

K-Means Clustering Algorithm
K-Means Clustering Algorithm

K-means clustering algorithm computes the centroids and iterates until we it finds optimal centroid. It assumes that the number of clusters are already known. It is also called flat clustering algorithm. The number of clusters identified from data by algorithm is represented by ‘K’ in K-means.

In this algorithm, the data points are assigned to a cluster in such a manner that the sum of the squared distance between the data points and centroid would be minimum. It is to be understood that less variation within the clusters will lead to more similar data points within same cluster.

Working of K-Means Algorithm

We can understand the working of K-Means clustering algorithm with the help of following steps −
- Step 1 − First, we need to specify the number of clusters, K, need to be generated by this algorithm.
- Step 2 − Next, randomly select K data points and assign each data point to a cluster. In simple words, classify the data based on the number of data points.
- Step 3 − Now it will compute the cluster centroids.
- Step 4 − Next, keep iterating the following until we find optimal centroid which is the assignment of data points to the clusters that are not changing any more −4.1 − First, the sum of squared distance between data points and centroids would be computed.4.2 − Now, we have to assign each data point to the cluster that is closer than other cluster (centroid).4.3 − At last compute the centroids for the clusters by taking the average of all data points of that cluster.
K-means follows Expectation-Maximization approach to solve the problem. The Expectation-step is used for assigning the data points to the closest cluster and the Maximization-step is used for computing the centroid of each cluster.

While working with K-means algorithm we need to take care of the following things −
- While working with clustering algorithms including K-Means, it is recommended to standardize the data because such algorithms use distance-based measurement to determine the similarity between data points.
- Due to the iterative nature of K-Means and random initialization of centroids, K-Means may stick in a local optimum and may not converge to global optimum. That is why it is recommended to use different initializations of centroids.
The K-Means algorithm is a straightforward and efficient algorithm, and it can handle large datasets. However, it has some limitations, such as its sensitivity to the initial centroids, its tendency to converge to local optima, and its assumption of equal variance for all clusters.

Objective of K-means Clustering

The main goals of cluster analysis are −
- To get a meaningful intuition from the data we are working with.
- Cluster-then-predict where different models will be built for different subgroups.
Implementation of K-Means Algorithm Using Python

Python has several libraries that provide implementations of various machine learning algorithms, including K-Means clustering. Let’s see how to implement the K-Means algorithm in Python using the scikit-learn library.

Example 1

It is a simple example to understand how k-means works. In this example, we generate 300 random data points with two features. And apply K-means algorithm to generate clusters.

Step 1 − Import Required Libraries

To implement the K-Means algorithm in Python, we first need to import the required libraries. We will use the numpy and matplotlib libraries for data processing and visualization, respectively, and the scikit-learn library for the K-Means algorithm.
```
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
```
Step 2 − Generate Data

To test the K-Means algorithm, we need to generate some sample data. In this example, we will generate 300 random data points with two features. We will visualize the data also.
```
X = np.random.rand(300,2)

plt.figure(figsize=(7.5,3.5))
plt.scatter(X[:,0], X[:,1], s=20, cmap='summer');
plt.show()
```
Output

Step 3 − Initialize K-Means

Next, we need to initialize the K-Means algorithm by specifying the number of clusters (K) and the maximum number of iterations.
```
kmeans = KMeans(n_clusters=3, max_iter=100)
```
Step 4 − Train the Model

After initializing the K-Means algorithm, we can train the model by fitting the data to the algorithm.
```
kmeans.fit(X)
```
Step 5 − Visualize the Clusters

To visualize the clusters, we can plot the data points and color them based on their assigned cluster.
```
plt.figure(figsize=(7.5,3.5))
plt.scatter(X[:,0], X[:,1], c=kmeans.labels_, s=20, cmap='summer')
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1],
marker='x', c='r', s=50, alpha=0.9)
plt.show()
```
Output

The output of the above code will be a plot with the data points colored based on their assigned cluster, and the centroids marked with an ‘x’ symbol in red color.

Example 2

In this example, we are going to first generate 2D dataset containing 4 different blobs and after that will apply k-means algorithm to see the result.

First, we will start by importing the necessary packages −
```
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()import numpy as np
from sklearn.cluster import KMeans
```
The following code will generate the 2D, containing four blobs −
```
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=400, centers=4, cluster_std=0.60, random_state=0)
```
Next, the following code will help us to visualize the dataset −
```
plt.scatter(X[:,0], X[:,1], s=20);
plt.show()
```
Next, make an object of KMeans along with providing number of clusters, train the model and do the prediction as follows −
```
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
```
Now, with the help of following code we can plot and visualize the cluster’s centers picked by k-means Python estimator −
```
plt.scatter(X[:,0], X[:,1], c=y_kmeans, s=20, cmap='summer')
centers = kmeans.cluster_centers_
plt.scatter(centers[:,0], centers[:,1], c='blue', s=100, alpha=0.9);
plt.show()
```
Example 3

Let us move to another example in which we are going to apply K-means clustering on simple digits dataset. K-means will try to identify similar digits without using the original label information.

First, we will start by importing the necessary packages −
```
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()import numpy as np
from sklearn.cluster import KMeans
```
Next, load the digit dataset from sklearn and make an object of it. We can also find number of rows and columns in this dataset as follows −
```
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape
```
Output
```
(1797, 64)
```
The above output shows that this dataset is having 1797 samples with 64 features.

We can perform the clustering as we did in Example 1 above −
```
kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits.data)
kmeans.cluster_centers_.shape
```
Output
```
(10, 64)
```
The above output shows that K-means created 10 clusters with 64 features.
```
fig, ax = plt.subplots(2,5, figsize=(8,3))
centers = kmeans.cluster_centers_.reshape(10,8,8)for axi, center inzip(ax.flat, centers):
   axi.set(xticks=[], yticks=[])
   axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)
```
Output

As output, we will get following image showing clusters centers learned by k-means.

The following lines of code will match the learned cluster labels with the true labels found in them −
```
from scipy.stats import mode
labels = np.zeros_like(clusters)for i inrange(10):
   mask =(clusters == i)
   labels[mask]= mode(digits.target[mask])[0]
```
Next, we can check the accuracy as follows −
```
from sklearn.metrics import accuracy_score
accuracy_score(digits.target, labels)
```
Output
```
0.7935447968836951
```
The above output shows that the accuracy is around 80%.

Advantages of K-Means Clustering Algorithm

The following are some advantages of K-Means clustering algorithms −
- It is very easy to understand and implement.
- If we have large number of variables then, K-means would be faster than Hierarchical clustering.
- On re-computation of centroids, an instance can change the cluster.
- Tighter clusters are formed with K-means as compared to Hierarchical clustering.
Disadvantages of K-Means Clustering Algorithm

The following are some disadvantages of K-Means clustering algorithms −
- It is a bit difficult to predict the number of clusters i.e. the value of k.
- Output is strongly impacted by initial inputs like number of clusters (value of k).
- Order of data will have strong impact on the final output.
- It is very sensitive to rescaling. If we will rescale our data by means of normalization or standardization, then the output will completely change.final output.
- It is not good in doing clustering job if the clusters have a complicated geometric shape.
Applications of K-Means Clustering

K-Means clustering is a versatile algorithm with various applications in several fields. Here we have highlighted some of the important applications −

Image Segmentation

K-Means clustering can be used to segment an image into different regions based on the color or texture of the pixels. This technique is widely used in computer vision applications, such as object recognition, image retrieval, and medical imaging.

Customer Segmentation

K-Means clustering can be used to segment customers into different groups based on their purchasing behavior or demographic characteristics. This technique is widely used in marketing applications, such as customer retention, loyalty programs, and targeted advertising.

Anomaly Detection

K-Means clustering can be used to detect anomalies in a dataset by identifying data points that do not belong to any cluster. This technique is widely used in fraud detection, network intrusion detection, and predictive maintenance.

Genomic Data Analysis

K-Means clustering can be used to analyze gene expression data to identify different groups of genes that are co-regulated or co-expressed. This technique is widely used in bioinformatics applications, such as drug discovery, disease diagnosis, and personalized medicine.
October 4, 2025
Centroid-Based Clustering

Centroid-based clustering is a class of machine learning algorithms that aims to partition a dataset into groups or clusters based on the proximity of data points to the centroid of each cluster.

The centroid of a cluster is the arithmetic mean of all the data points in that cluster and serves as a representative point for that cluster.

The two most popular centroid-based clustering algorithms are −

K-means Clustering

K-Means clustering is a popular unsupervised machine learning algorithm used for clustering data. It is a simple and efficient algorithm that can group data points into K clusters based on their similarity. The algorithm works by first randomly selecting K centroids, which are the initial centers of each cluster. Each data point is then assigned to the cluster whose centroid is closest to it. The centroids are then updated by taking the mean of all the data points in the cluster. This process is repeated until the centroids no longer move or the maximum number of iterations is reached.

K-Medoids Clustering

K-medoids clustering is a partition-based clustering algorithm that is used to cluster a set of data points into “k” clusters. Unlike K-means clustering, which uses the mean value of the data points to represent the center of the cluster, K-medoids clustering uses a representative data point, called a medoid, to represent the center of the cluster. The medoid is the data point that minimizes the sum of the distances between it and all the other data points in the cluster. This makes K-medoids clustering more robust to outliers and noise than K-means clustering.

We will discuss these two clustering methods in the next two chapters.

October 4, 2025
Clustering Algorithms
Clustering Algorithms are one of the most useful unsupervised machine learning methods. These methods are used to find similarity as well as the relationship patterns among data samples and then cluster those samples into groups having similarity based on features.

Clustering is important because it determines the intrinsic grouping among the present unlabeled data. They basically make some assumptions about data points to constitute their similarity. Each assumption will construct different but equally valid clusters.

For example, below is the diagram which shows clustering system grouped together the similar kind of data in different clusters −

Cluster Formation Methods

It is not necessary that clusters will be formed in spherical form. Followings are some other cluster formation methods −

Density-based

In these methods, the clusters are formed as the dense region. The advantage of these methods is that they have good accuracy as well as good ability to merge two clusters. Ex. Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Ordering Points to identify Clustering structure (OPTICS) etc.

Hierarchical-based

In these methods, the clusters are formed as a tree type structure based on the hierarchy. They have two categories namely, Agglomerative (Bottom up approach) and Divisive (Top down approach). Ex. Clustering using Representatives (CURE), Balanced iterative Reducing Clustering using Hierarchies (BIRCH) etc.

Partitioning

In these methods, the clusters are formed by portioning the objects into k clusters. Number of clusters will be equal to the number of partitions. Ex. K-means, Clustering Large Applications based upon randomized Search (CLARANS).

Grid

In these methods, the clusters are formed as a grid like structure. The advantage of these methods is that all the clustering operation done on these grids are fast and independent of the number of data objects. Ex. Statistical Information Grid (STING), Clustering in Quest (CLIQUE).

Clustering Algorithms in Machine Learning

The following are the most important and useful machine learning clustering algorithms −
K-Means Clustering

The K-Means clustering algorithm computes the centroids and iterates until we it finds optimal centroid. It assumes that the number of clusters are already known. It is also called flat clustering algorithm. The number of clusters identified from data by algorithm is represented by ‘K’ in K-means.

K-Medoids Clustering

The K-Methoids Clustering is an improved version of K-means clustering algorithm. Working is as follows
- Select k random data points from the dataset as the initial medoids.
- Assign each data point to the nearest medoid.
- For each cluster, select the data point that minimizes the sum of distances to all the other data points in the cluster, and set it as the new medoid.
- Repeat steps 2 and 3 until convergence or a maximum number of iterations is reached.
Mean-Shift Clustering

Mean-Shift ClusteringIt is another powerful clustering algorithm used in unsupervised learning. Unlike K-means clustering, it does not make any assumptions hence it is a non-parametric algorithm.

DBSCAN Clustering

The DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm is one of the most common density-based clustering algorithms. The DBSCAN algorithm requires two parameters: the minimum number of neighbors (minPts) and the maximum distance between core data points (eps).

OPTICS Clustering

OPTICS (Ordering Points to Identify the Clustering Structure) is like DBSCAN, another popular density-based clustering algorithm. However, OPTICS has several advantages over DBSCAN, including the ability to identify clusters of varying densities, the ability to handle noise, and the ability to produce a hierarchical clustering structure.

HDBSCAN Clustering

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that is based on density clustering. It is a newer algorithm that builds upon the popular DBSCAN algorithm and offers several advantages over it, such as better handling of clusters of varying densities and the ability to detect clusters of different shapes and sizes.

BIRCH algorithm

BIRCH (Balanced Iterative Reducing and Clustering hierarchies) is a hierarchical clustering algorithm that is designed to handle large datasets efficiently. The algorithm builds a treelike structure of clusters by recursively partitioning the data into subclusters until a stopping criterion is met.

Affinity Propagation Clustering

Affinity Propagation is a clustering algorithm that identifies “exemplars” in a dataset and assigns each data point to one of these exemplars. It is a type of clustering algorithm that does not require a pre-specified number of clusters, making it a useful tool for exploratory data analysis. Affinity Propagation was introduced by Frey and Dueck in 2007 and has since been widely used in many fields such as biology, computer vision, and social network analysis.

Agglomerative Clustering

Agglomerative clustering is a hierarchical clustering algorithm that starts with each data point as its own cluster and iteratively merges the closest clusters until a stopping criterion is reached. It is a bottom-up approach that produces a dendrogram, which is a tree-like diagram that shows the hierarchical relationship between the clusters. The algorithm can be implemented using the scikit-learn library in Python.

Gaussian Mixture Model

Gaussian Mixture Models (GMM) is a popular clustering algorithm used in machine learning that assumes that the data is generated from a mixture of Gaussian distributions. In other words, GMM tries to fit a set of Gaussian distributions to the data, where each Gaussian distribution represents a cluster in the data.

Measuring Clustering Performance

One of the most important consideration regarding ML model is assessing its performance or you can say model’s quality. In case of supervised learning algorithms, assessing the quality of our model is easy because we already have labels for every example.

On the other hand, in case of unsupervised learning algorithms we are not that much blessed because we deal with unlabeled data. But still we have some metrics that give the practitioner an insight about the happening of change in clusters depending on algorithm.

Before we deep dive into such metrics, we must understand that these metrics only evaluates the comparative performance of models against each other rather than measuring the validity of the model’s prediction. Followings are some of the metrics that we can deploy on clustering algorithms to measure the quality of model −1. Silhouette Analysis 2. Davis-Bouldin Index 3. Dunn Index

1. Silhouette Analysis

Silhouette analysis used to check the quality of clustering model by measuring the distance between the clusters. It basically provides us a way to assess the parameters like number of clusters with the help of Silhouette score. This score measures how close each point in one cluster is to points in the neighboring clusters.

Analysis of Silhouette Score

The range of Silhouette score is [-1, 1]. Its analysis is as follows −
- +1 Score − Near +1 Silhouette score indicates that the sample is far away from its neighboring cluster.
- 0 Score − 0 Silhouette score indicates that the sample is on or very close to the decision boundary separating two neighboring clusters.
- -1 Score &minusl -1 Silhouette score indicates that the samples have been assigned to the wrong clusters.
The calculation of Silhouette score can be done by using the following formula −

=()/ (,)

Here, = mean distance to the points in the nearest cluster

And, = mean intra-cluster distance to all the points.

2. Davis-Bouldin Index

DB index is another good metric to perform the analysis of clustering algorithms. With the help of DB index, we can understand the following points about clustering model −
- Weather the clusters are well-spaced from each other or not?
- How much dense the clusters are?
We can calculate DB index with the help of following formula −

DB=1n∑i=1nmaxj≠i(σi+σjd(ci,cj))DB=1n∑i=1nmaxj≠i(σi+σjd(ci,cj))

Here, = number of clusters

σ_i = average distance of all points in cluster from the cluster centroid .

Less the DB index, better the clustering model is.

3. Dunn Index

It works same as DB index but there are following points in which both differs −
- The Dunn index considers only the worst case i.e. the clusters that are close together while DB index considers dispersion and separation of all the clusters in clustering model.
- Dunn index increases as the performance increases while DB index gets better when clusters are well-spaced and dense.
We can calculate Dunn index with the help of following formula −

D=min1≤i<j≤nP(i,j)mix1≤i<k≤nq(k)D=min1≤i<j≤nP(i,j)mix1≤i<k≤nq(k)

Here, ,, = each indices for clusters

= inter-cluster distance

q = intra-cluster distance

Applications of Clustering

We can find clustering useful in the following areas −

Data summarization and compression − Clustering is widely used in the areas where we require data summarization, compression and reduction as well. The examples are image processing and vector quantization.

Collaborative systems and customer segmentation − Since clustering can be used to find similar products or same kind of users, it can be used in the area of collaborative systems and customer segmentation.

Serve as a key intermediate step for other data mining tasks − Cluster analysis can generate a compact summary of data for classification, testing, hypothesis generation; hence, it serves as a key intermediate step for other data mining tasks also.

Trend detection in dynamic data − Clustering can also be used for trend detection in dynamic data by making various clusters of similar trends.

Social network analysis − Clustering can be used in social network analysis. The examples are generating sequences in images, videos or audios.

Biological data analysis − Clustering can also be used to make clusters of images, videos hence it can successfully be used in biological data analysis.
October 4, 2025
Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is a popular optimization technique in machine learning. It iteratively updates the model parameters (weights and bias) using individual training example instead of entire dataset. It is a variant of gradient descent and it is more efficient and faster for large and sparse dataset.

What is Gradient Descent?

Gradient Descent is a popular optimization algorithm that is used to minimize the cost function of a machine learning model. It works by iteratively adjusting the model parameters to minimize the difference between the predicted output and the actual output. The algorithm works by calculating the gradient of the cost function with respect to the model parameters and then adjusting the parameters in the opposite direction of the gradient.

What is Stochastic Gradient Descent (SGD)?

Stochastic Gradient Descent is a variant of Gradient Descent that updates the parameters using each training example instead of updating them after evaluating the entire dataset. This means that instead of using the entire dataset to calculate the gradient of the cost function, SGD only uses a single training example (or a mini batch). This approach allows the algorithm to converge faster and requires less memory to store the data.

Stochastic Gradient Descent Algorithm

Stochastic Gradient Descent works by randomly selecting a single (or a small mini batch) training example from the dataset and using it to update the model parameters. This process is repeated for a fixed number of epochs, or until the model converges to a minimum of the cost function.

Here’s how the Stochastic Gradient Descent algorithm works −
- Initialize the model parameters to random values.
- For each epoch, randomly shuffle the training data.
- For each training example −
  - Calculate the gradient of the cost function with respect to the model parameters.
  - Update the model parameters in the opposite direction of the gradient.
- Repeat until convergence
The parameters or weights update rule for SGD is as follows −

w:=w−J(w;xi,yi)w:=w−J(w;xi,yi)

where,
- xixi: The iith data point of input data
- yiyi: The corresponding target value
- αα: The learning rate
- JJ: The loss or cost function
- JJ: The gradient of loss or cost function JJ w.r.t. ww.
Here “:=” denotes the update of a variable in the algorithm.

The main difference between Stochastic Gradient Descent and regular Gradient Descent is the way that the gradient is calculated and the way that the model parameters are updated. In Stochastic Gradient Descent, the gradient is calculated using a single training example, while in Gradient Descent, the gradient is calculated using the entire dataset.

Implementation of Stochastic Gradient Descent in Python

Let’s look at an example of how to implement Stochastic Gradient Descent in Python. We will use the scikit-learn library to implement the algorithm on the Iris dataset which is a popular dataset used for classification tasks. In this example we will be predicting Iris flower species using its two features namely sepal width and sepal length −

Example
```
# Import required librariesimport sklearn

import numpy as np
from sklearn import datasets
from sklearn.linear_model import SGDClassifier

# Loading Iris flower dataset
iris = datasets.load_iris()
X_data, y_data = iris.data, iris.target

# Dividing the dataset into training and testing datasetfrom sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Getting the Iris dataset with only the first two attributes
X, y = X_data[:,:2], y_data

# Split the dataset into a training and a testing set(20 percent)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.20, random_state=1)# Standarize the features
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)# create the linear model SGDclassifier
clfmodel_SGD = SGDClassifier(alpha=0.001, max_iter=200)# Train the classifier using fit() function
clfmodel_SGD.fit(X_train, y_train)# Evaluate the resultfrom sklearn import metrics
y_train_pred = clfmodel_SGD.predict(X_train)print("\nThe Accuracy of SGD classifier is:",
metrics.accuracy_score(y_train, y_train_pred)*100)
```
Output

When you run this code, it will produce the following output −
```
The Accuracy of SGD classifier is: 77.5
```
Applications of Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is not a full-fledged machine learning model, but just an optimization technique. It has bees successfully applied in different machine learning problems mainly when data is sparse. The Sparse ML problems are mainly encountered in text classification and natural language processing. This technique is very efficient for sparse data and can scale to the problems with more than tens of thousands examples and more than tens of thousands of features.

Advantages of SGD

The following are some advantages of Stochastic Gradient Descent −
- Efficiency − Processes data in smaller batches, reducing memory requirements.
- Faster Convergence − Can converge faster than batch gradient descent, especially for large datasets.
- Escaping Local Minima − The stochastic nature of SGD can help it escape local minima and find better solutions.
Challenges of Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is an efficient optimization algorithm but comes with challenges that can affect its effectiveness. The following are some challenges fo SGD −
- Noisy Gradients − The stochastic nature of SGD can lead to noisy gradients, which may slow down convergence.
- Learning Rate Tuning − Choosing the right learning rate is crucial for effective optimization.
- Mini-batch Size − The choice of mini-batch size affects the convergence speed and stability of the algorithm.
October 4, 2025
Confusion Matrix
What is Confusion Matrix?

The confusion matrix in machine learning is the easiest way to measure the performance of a classification problem where the output can be of two or more type of classes. It is nothing but a table with two dimensions viz. “Actual” and “Predicted” and furthermore, both the dimensions have “True Positives (TP)”, “True Negatives (TN)”, “False Positives (FP)”, “False Negatives (FN)” as shown below −

Take an example of classifying emails as “spam” and “not spam” for a better understanding. Here a spam email is labeled as “positive” and a legitimate (not spam) email is labeled as negative.

Explanation of the terms associated with confusion matrix are as follows −
- True Positives (TP) − It is the case when both actual class & predicted class of data point is 1. The classification model correctly predicts the positive class label for data sample. For example, a “spam” email is classified as “spam”.
- True Negatives (TN) − It is the case when both actual class & predicted class of data point is 0. The model correctly predicts the negative class label for data sample. For example, a “not spam” email is classified as “not spam”.
- False Positives (FP) − It is the case when actual class of data point is 0 & predicted class of data point is 1. The model incorrectly predicts the positive class label for data sample. For example, a “not spam” email is misclassified as “spam”. It is known as a Type I error.
- False Negatives (FN) − It is the case when actual class of data point is 1 & predicted class of data point is 0. The model incorrectly predicts the negative class label for data sample. For example, a “spam” email is misclassified as “not spam”. It is also known as Type II error.
We use the confusion matrix to find correct and incorrect classifications −
- Correct classification − TP and TN are correctly classified data points.
- Incorrect classification − FP and FN are incorrectly classified data points.
We can use the confusion matrix to calculate different classification metrics such as accuracy, precision, recall, etc. But before discussing these metrics, let’s take understand how to create a confusion matrix with the help of a practical exmaple.

Confusion Matrix Practical Example

Let’s take a practical example for a classifications of emails as “spam” or “not spam”. Here we are representing class for a spam email as positive (1) and a not spam email as negative (0). So emails are classified either −
- spam (1) − positive class lebel
- not spam (0) − negative class lebel
The actual and predicted classes/ categories are as follows −

Actual Classification 0 1 0 1 1 0 0 1 1 1
Predicted Classification 0 1 0 1 0 1 0 0 1 1

So with the above results, let’s find out that a particular classification falls under TP, TN, FP or FN. Look at the below table −

Actual Classification 0 1 0 1 1 0 0 1 1 1
Predicted Classification 0 1 0 1 0 1 0 0 1 1
Result TN TP TN TP FN FP TN FN TP TP

In above table, when we compare actual classification set to the predicted classification, we observe that there are four different types of outcomes. First, true positive (1,1), i.e. the actual classification is positive and predicted classification is also postive. This means the classifier has identified postive sample correctly. Second, false negative (1,0), i.e., the actual classification is positive and predicted classification in negative. The classifier has identified positive sample as negative.

Third, false positive, (0,1), i.e., the actual classification is negative and predicted classification is postive. The negative sample is incorrectly identified as positive. Fourth, true negative (0,0), i.e., the actual and predicted classifications are negative. The negative sample is correctly identified by model as negative.

Let’s find the total number of samples in each categories.
- TP (True Positive): 4
- FN (False Negative): 2
- FP (False Positive): 1
- TN (True Negative): 3
Let’s now create confusion matrix as following −

Actual Class
Positive (1) Negative (0)
Predicted Class Positive (1) 4 (TP) 1 (FP)
Negative (0) 2 (FN) 3 (TN)

So far we have created the confusion matrix for above problem. Let’s infer some meaning from the above matrix −
- Amongst 10 emails, four “spam” emails are correctly classified as “spam” (TP).
- Amongst 10 emails, two “spam” emails are incorrectly classified as “not spam” (FN).
- Amongst 10 emails, one “not spam” email is incorrectly classified as “spam” (FP).
- Amongst 1o emails, three “not spam” emails are correctly classified as “not spam” (TN).
- So Amongst 10 emails, seven emails are correctly classified (TP & TN)and three emails are incorrectly classified (FP & FN).
Classificaiton Metrics Based on Confusion Matrix

We can define many classificaiton performance metrics using the confusion matrix. We will consider the above practical example and calculate the metrics using the values in that example. Some of them are as follows −
- Accuracy
- Precision
- Recall or Sensitivity
- Specificity
- F1 Score
- Type I Error Rate
- Type II Error Rate
Accuracy

Accuracy is most common metrics to evaluate a classification model. It is ratio of total correction predictions and all predictions made. Mathematically we can use the following formula to calculate accurcy −

Accuracy=TP+TNTP+FP+FN+TNAccuracy=TP+TNTP+FP+FN+TN

Let’s calculate the accuracy −

Accuracy=4+34+1+2+3=710=0.7Accuracy=4+34+1+2+3=710=0.7

Hence the model’s classification accuracy is 70%.

Precision

Precision measures the proportion of true positive instances out of all predicted positive instances. It is calculated as ratio of the number of true positive instances and the sum of true positive and false positive instances.

Precision=TPTP+FPPrecision=TPTP+FP

Let’s calculate the precision −

Precision=44+1=45=0.8Precision=44+1=45=0.8

Recall or Sensitivity

Recall (Sensitivity) is defined as the number of positives classifications by the classifier. We can calculate it with the help of following formula

Recall=TPTP+FNRecall=TPTP+FN

Let’s calculate recall −

Recall=44+2=46=0.666Recall=44+2=46=0.666

Specificity

Specificity, in contrast to recall, is defined as the number of negatives returned by the classifier. We can calculate it with the help of following formula −

Specificity=TNTN+FPSpecificity=TNTN+FP

Let’s calculate the specificity −

Specificity=33+1=34=0.75Specificity=33+1=34=0.75

F1 Score

F1 score is a balanced measure that takes into account both precision and recall. It is the harmonic mean of precision and recall.

We can calculate F1 score with the help of following formula −

F1Score=2×(Precision×Recall)Precision+RecallF1Score=2×(Precision×Recall)Precision+Recall

Let’s calculate F1 score −

F1Score=2×(0.8×0.667)0.8+0.667=0.727F1Score=2×(0.8×0.667)0.8+0.667=0.727

Hence, F1 score is 0.727.

Type I Error Rate

Type I error occurs when the classifier predicts positive classification but it is actually negative class. The type I error rate is calculated as −

TypeIErrorRate=FPFP+TNTypeIErrorRate=FPFP+TN

TypeIErrorRate=11+3=14=0.25TypeIErrorRate=11+3=14=0.25

Type II Error Rate

Type II error occurs when the classifier predicts negative but it is actually positive class. The type II error rate can be calculate as −

TypeIIErrorRate=FNFN+TPTypeIIErrorRate=FNFN+TP

TypeIIErrorRate=22+4=26=0.333TypeIIErrorRate=22+4=26=0.333

How to Implement Confusion Matrix in Python?

To implement the confusion matrix in Python, we can use the confusion_matrix() function from the sklearn.metrics module of the scikit-learn library.

Note: Please note that the confusion_matrix() function returns a 2D array that correspondence to the following confusion matrix −

Predicted Class
Negative (0) Positive (1)
Actual Class Negative (0) True Negative (TN) False Positive (FP)
Positive (1) False Negative (FN) True Positive (TP)

Here is an simple example of how to use the confusion_matrix() function −
```
from sklearn.metrics import confusion_matrix

# Actual values
y_actual =[0,1,0,1,1,0,0,1,1,1]# Predicted values
y_pred =[0,1,0,1,0,1,0,0,1,1]# Confusion matrix
cm = confusion_matrix(y_actual, y_pred)print(cm)
```
In this example, we have two arrays: y_actual contains the actual values of the target variable, and y_pred contains the predicted values of the target variable. We then call the confusion_matrix() function, passing in y_actual and y_pred as arguments. The function returns a 2D array that represents the confusion matrix.

The output of the code above will look like this −
```
[[3 1]
 [2 4]]
```
Compare the above result with the confusion matrix we created above.
- True Negative (TN): 3
- False Positive (FP): 1
- False Negative (FN): 2
- True Positive (TP): 4
We can also visualize the confusion matrix using a heatmap. Below is how we can do that using the heatmap() function from the seaborn library
```
import seaborn as sns

# Plot confusion matrix as heatmap
sns.heatmap(cm, annot=True, cmap='summer')
```
This will produce a heatmap that shows the confusion matrix −

In this heatmap, the x-axis represents the predicted values, and the y-axis represents the actual values. The color of each square in the heatmap indicates the number of samples that fall into each category.
October 4, 2025
Random Forest Algorithm
Random Forest is a machine learning algorithm that uses an ensemble of decision trees to make predictions. The algorithm was first introduced by Leo Breiman in 2001. The key idea behind the algorithm is to create a large number of decision trees, each of which is trained on a different subset of the data. The predictions of these individual trees are then combined to produce a final prediction.

Working of Random Forest Algorithm

We can understand the working of Random Forest algorithm with the help of following steps −
- Step 1 − First, start with the selection of random samples from a given dataset.
- Step 2 − Next, this algorithm will construct a decision tree for every sample. Then it will get the prediction result from every decision tree.
- Step 3 − In this step, voting will be performed for every predicted result.
- Step 4 − At last, select the most voted prediction result as the final prediction result.
The following diagram illustrates how the Random Forest Algorithm works −

Random Forest is a flexible algorithm that can be used for both classification and regression tasks. In classification tasks, the algorithm uses the mode of the predictions of the individual trees to make the final prediction. In regression tasks, the algorithm uses the mean of the predictions of the individual trees.

Advantages of Random Forest Algorithm

Random Forest algorithm has several advantages over other machine learning algorithms. Some of the key advantages are −
- Robustness to Overfitting − Random Forest algorithm is known for its robustness to overfitting. This is because the algorithm uses an ensemble of decision trees, which helps to reduce the impact of outliers and noise in the data.
- High Accuracy − Random Forest algorithm is known for its high accuracy. This is because the algorithm combines the predictions of multiple decision trees, which helps to reduce the impact of individual decision trees that may be biased or inaccurate.
- Handles Missing Data − Random Forest algorithm can handle missing data without the need for imputation. This is because the algorithm only considers the features that are available for each data point and does not require all features to be present for all data points.
- Non-Linear Relationships − Random Forest algorithm can handle non-linear relationships between the features and the target variable. This is because the algorithm uses decision trees, which can model non-linear relationships.
- Feature Importance − Random Forest algorithm can provide information about the importance of each feature in the model. This information can be used to identify the most important features in the data and can be used for feature selection and feature engineering.
Implementation of Random Forest Algorithm in Python

Let’s take a look at the implementation of Random Forest Algorithm in Python. We will be using the scikit-learn library to implement the algorithm. The scikit-learn library is a popular machine learning library that provides a wide range of algorithms and tools for machine learning.

Step 1 − Importing the Libraries

We will begin by importing the necessary libraries. We will be using the pandas library for data manipulation, and the scikit-learn library for implementing the Random Forest algorithm.
```
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
```
Step 2 − Loading the Data

Next, we will load the data into a pandas dataframe. For this tutorial, we will be using the famous Iris dataset, which is a classic dataset for classification tasks.
```
# Loading the iris dataset

iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learningdatabases/iris/iris.data', header=None)

iris.columns =['sepal_length','sepal_width','petal_length','petal_width','species']
```
Step 3 − Data Preprocessing

Before we can use the data to train our model, we need to preprocess it. This involves separating the features and the target variable and splitting the data into training and testing sets.
```
# Separating the features and target variable
X = iris.iloc[:,:-1]
y = iris.iloc[:,-1]# Splitting the data into training and testing setsfrom sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=42)
```
Step 4 − Training the Model

Next, we will train our Random Forest classifier on the training data.
```
# Creating the Random Forest classifier object
rfc = RandomForestClassifier(n_estimators=100)# Training the model on the training data
rfc.fit(X_train, y_train)
```
Step 5 − Making Predictions

Once we have trained our model, we can use it to make predictions on the test data.
```
# Making predictions on the test data
y_pred = rfc.predict(X_test)
```
Step 6 − Evaluating the Model

Finally, we will evaluate the performance of our model using various metrics such as accuracy, precision, recall, and F1-score.
```
# Importing the metrics libraryfrom sklearn.metrics import accuracy_score, precision_score,
recall_score, f1_score

# Calculating the accuracy, precision, recall, and F1-score
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')print("Accuracy:", accuracy)print("Precision:", precision)print("Recall:", recall)print("F1-score:", f1)
```
Complete Implementation Example

Below is the complete implementation example of Random Forest Algorithm in python using the iris dataset −
```
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# Loading the iris dataset
iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learningdatabases/iris/iris.data', header=None)

iris.columns =['sepal_length','sepal_width','petal_length','petal_width','species']# Separating the features and target variable
X = iris.iloc[:,:-1]
y = iris.iloc[:,-1]# Splitting the data into training and testing setsfrom sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.35, random_state=42)# Creating the Random Forest classifier object
rfc = RandomForestClassifier(n_estimators=100)# Training the model on the training data
rfc.fit(X_train, y_train)# Making predictions on the test data
y_pred = rfc.predict(X_test)# Importing the metrics libraryfrom sklearn.metrics import accuracy_score, precision_score,
recall_score, f1_score

# Calculating the accuracy, precision, recall, and F1-score
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')print("Accuracy:", accuracy)print("Precision:", precision)print("Recall:", recall)print("F1-score:", f1)
```
Output

This will give us the performance metrics of our Random Forest classifier as follows −
```
Accuracy: 0.9811320754716981
Precision: 0.9821802935010483
Recall: 0.9811320754716981
F1-score: 0.9811157396063056
```
Pros and Cons of Random Forest

Pros

The following are the advantages of Random Forest algorithm −
- It overcomes the problem of overfitting by averaging or combining the results of different decision trees.
- Random forests work well for a large range of data items than a single decision tree does.
- Random forest has less variance then single decision tree.
- Random forests are very flexible and possess very high accuracy.
- Scaling of data does not require in random forest algorithm. It maintains good accuracy even after providing data without scaling.
- Scaling of data does not require in random forest algorithm. It maintains good accuracy even after providing data without scaling.
Cons

The following are the disadvantages of Random Forest algorithm −
- Complexity is the main disadvantage of Random forest algorithms.
- Construction of Random forests are much harder and time-consuming than decision trees.
- More computational resources are required to implement Random Forest algorithm.
- It is less intuitive in case when we have a large collection of decision trees .
- The prediction process using random forests is very time-consuming in comparison with other algorithms.
October 4, 2025
Support Vector Machine (SVM)
What is Support Vector Machine (SVM)

Support vector machines (SVMs) are powerful yet flexible supervised machine learning algorithm which is used for both classification and regression. But generally, they are used in classification problems. In 1960s, SVMs were first introduced but later they got refined in 1990 also. SVMs have their unique way of implementation as compared to other machine learning algorithms. Now a days, they are extremely popular because of their ability to handle multiple continuous and categorical variables.

Working of SVM

The goal of SVM is to find a hyperplane that separates the data points into different classes. A hyperplane is a line in 2D space, a plane in 3D space, or a higher-dimensional surface in n-dimensional space. The hyperplane is chosen in such a way that it maximizes the margin, which is the distance between the hyperplane and the closest data points of each class. The closest data points are called the support vectors.

The distance between the hyperplane and a data point “x” can be calculated using the formula −
```
distance =(w . x + b)/||w||
```
where “w” is the weight vector, “b” is the bias term, and “||w||” is the Euclidean norm of the weight vector. The weight vector “w” is perpendicular to the hyperplane and determines its orientation, while the bias term “b” determines its position.

The optimal hyperplane is found by solving an optimization problem, which is to maximize the margin subject to the constraint that all data points are correctly classified. In other words, we want to find the hyperplane that maximizes the margin between the two classes while ensuring that no data point is misclassified. This is a convex optimization problem that can be solved using quadratic programming.

If the data points are not linearly separable, we can use a technique called kernel trick to map the data points into a higher-dimensional space where they become separable. The kernel function computes the inner product between the mapped data points without computing the mapping itself. This allows us to work with the data points in the higherdimensional space without incurring the computational cost of mapping them.

Let’s understand it in detail with the help of following diagram −

Given below are the important concepts in SVM −
- Support Vectors − Datapoints that are closest to the hyperplane is called support vectors. Separating line will be defined with the help of these data points.
- Hyperplane − As we can see in the above diagram it is a decision plane or space which is divided between a set of objects having different classes.
- Margin − It may be defined as the gap between two lines on the closet data points of different classes. It can be calculated as the perpendicular distance from the line to the support vectors. Large margin is considered as a good margin and small margin is considered as a bad margin.
Implementing SVM Using Python

For implementing SVM in Python we will start with the standard libraries import as follows −
```
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns; sns.set()
```
Next, we are creating a sample dataset, having linearly separable data, from sklearn.dataset.sample_generator for classification using SVM −
```
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=100, centers=2, random_state=0, cluster_std=0.50)
plt.scatter(X[:,0], X[:,1], c=y, s=50, cmap='summer');
```
The following would be the output after generating sample dataset having 100 samples and 2 clusters −

We know that SVM supports discriminative classification. it divides the classes from each other by simply finding a line in case of two dimensions or manifold in case of multiple dimensions. It is implemented on the above dataset as follows −
```
xfit = np.linspace(-1,3.5)
plt.scatter(X[:,0], X[:,1], c=y, s=50, cmap='summer')
plt.plot([0.6],[2.1],'x', color='black', markeredgewidth=4, markersize=12)for m, b in[(1,0.65),(0.5,1.6),(-0.2,2.9)]:
   plt.plot(xfit, m * xfit + b,'-k')
plt.xlim(-1,3.5);
```
The output is as follows −

We can see from the above output that there are three different separators that perfectly discriminate the above samples.

As discussed, the main goal of SVM is to divide the datasets into classes to find a maximum marginal hyperplane (MMH) hence rather than drawing a zero line between classes we can draw around each line a margin of some width up to the nearest point. It can be done as follows −
```
xfit = np.linspace(-1,3.5)
plt.scatter(X[:,0], X[:,1], c=y, s=50, cmap='summer')for m, b, d in[(1,0.65,0.33),(0.5,1.6,0.55),(-0.2,2.9,0.2)]:
  yfit = m * xfit + b
  plt.plot(xfit, yfit,'-k')
  plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none',
    color='#AAAAAA', alpha=0.4)
  plt.xlim(-1,3.5);
```
From the above image in output, we can easily observe the “margins” within the discriminative classifiers. SVM will choose the line that maximizes the margin.

Next, we will use Scikit-Learn’s support vector classifier to train an SVM model on this data. Here, we are using linear kernel to fit SVM as follows −
```
from sklearn.svm import SVC # "Support vector classifier"
model = SVC(kernel='linear', C=1E10)
model.fit(X, y)
```
The output is as follows −
```
SVC(C=10000000000.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
```
Now, for a better understanding, the following will plot the decision functions for 2D SVC −
```
defdecision_function(model, ax=None, plot_support=True):if ax isNone:
  ax = plt.gca()   xlim = ax.get_xlim()
   ylim = ax.get_ylim()
```
For evaluating model, we need to create grid as follows −
```
x = np.linspace(xlim[0], xlim[1],30)
y = np.linspace(ylim[0], ylim[1],30)
Y, X = np.meshgrid(y, x)
xy = np.vstack([X.ravel(), Y.ravel()]).T
P = model.decision_function(xy).reshape(X.shape)
```
Next, we need to plot decision boundaries and margins as follows −
```
ax.contour(X, Y, P, colors='k',
   levels=[-1,0,1], alpha=0.5,
   linestyles=['--','-','--'])
```
Now, similarly plot the support vectors as follows −
```
if plot_support:
   ax.scatter(model.support_vectors_[:,0],
  model.support_vectors_[:,1],
  s=300, linewidth=1, facecolors='none');ax.set_xlim(xlim)
ax.set_ylim(ylim)
```
Now, use this function to fit our models as follows −
```
plt.scatter(X[:,0], X[:,1], c=y, s=50, cmap='summer')
decision_function(model);
```
We can observe from the above output that an SVM classifier fit to the data with margins i.e. dashed lines and support vectors, the pivotal elements of this fit, touching the dashed line. These support vector points are stored in the support_vectors_ attribute of the classifier as follows −
```
model.support_vectors_
```
The output is as follows −
```
array([[0.5323772 , 3.31338909],
   [2.11114739, 3.57660449],
   [1.46870582, 1.86947425]])
```
SVM Kernels

In practice, SVM algorithm is implemented with kernel that transforms an input data space into the required form. SVM uses a technique called the kernel trick in which kernel takes a low dimensional input space and transforms it into a higher dimensional space. In simple words, kernel converts non-separable problems into separable problems by adding more dimensions to it. It makes SVM more powerful, flexible and accurate. The following are some of the types of kernels used by SVM −

Linear Kernel

It can be used as a dot product between any two observations. The formula of linear kernel is as below −

k(x,x_i) = sum(x*x_i)

From the above formula, we can see that the product between two vectors say & is the sum of the multiplication of each pair of input values.

Polynomial Kernel

It is more generalized form of linear kernel and distinguish curved or nonlinear input space. Following is the formula for polynomial kernel −

K(x, xi) = 1 + sum(x * xi)^d

Here d is the degree of polynomial, which we need to specify manually in the learning algorithm.

Radial Basis Function (RBF) Kernel

RBF kernel, mostly used in SVM classification, maps input space in indefinite dimensional space. Following formula explains it mathematically −

K(x,xi) = exp(-gamma * sum((x xi^2))

Here, gamma ranges from 0 to 1. We need to manually specify it in the learning algorithm. A good default value of gamma is 0.1.

As we implemented SVM for linearly separable data, we can implement it in Python for the data that is not linearly separable. It can be done by using kernels.

Example

The following is an example for creating an SVM classifier by using kernels. We will be using iris dataset from scikit-learn −

We will start by importing following packages −
```
import pandas as pd
import numpy as np
from sklearn import svm, datasets
import matplotlib.pyplot as plt
```
Now, we need to load the input data −
```
iris = datasets.load_iris()
```
From this dataset, we are taking first two features as follows −
```
X = iris.data[:,:2]
y = iris.target
```
Next, we will plot the SVM boundaries with original data as follows −
```
x_min, x_max = X[:,0].min()-1, X[:,0].max()+1
y_min, y_max = X[:,1].min()-1, X[:,1].max()+1
h =(x_max / x_min)/100
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
   np.arange(y_min, y_max, h))
X_plot = np.c_[xx.ravel(), yy.ravel()]
```
Now, we need to provide the value of regularization parameter as follows −
```
C =1.0
```
Next, SVM classifier object can be created as follows −

Svc_classifier = svm.SVC(kernel=’linear’, C=C).fit(X, y)
```
Z = svc_classifier.predict(X_plot)
Z = Z.reshape(xx.shape)
plt.figure(figsize=(15,5))
plt.subplot(121)
plt.contourf(xx, yy, Z, cmap=plt.cm.tab10, alpha=0.3)
plt.scatter(X[:,0], X[:,1], c=y, cmap=plt.cm.Set1)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('Support Vector Classifier with linear kernel')
```
Output
```
Text(0.5, 1.0, 'Support Vector Classifier with linear kernel')
```
For creating SVM classifier with rbf kernel, we can change the kernel to rbf as follows −
```
Svc_classifier = svm.SVC(kernel='rbf', gamma ='auto',C=C).fit(X, y)
Z = svc_classifier.predict(X_plot)
Z = Z.reshape(xx.shape)
plt.figure(figsize=(15,5))
plt.subplot(121)
plt.contourf(xx, yy, Z, cmap=plt.cm.tab10, alpha=0.3)
plt.scatter(X[:,0], X[:,1], c=y, cmap=plt.cm.Set1)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('Support Vector Classifier with rbf kernel')
```
Output
```
Text(0.5, 1.0, 'Support Vector Classifier with rbf kernel')
```
We put the value of gamma to ‘auto’ but you can provide its value between 0 to 1 also.

Tuning SVM Parameters

In practice, SVMs often require tuning of their parameters to achieve optimal performance. The most important parameters to tune are the kernel, the regularization parameter C, and the kernel-specific parameters.

The kernel parameter determines the type of kernel to use. The most common kernel types are linear, polynomial, radial basis function (RBF), and sigmoid. The linear kernel is used for linearly separable data, while the other kernels are used for non-linearly separable data.

The regularization parameter C controls the trade-off between maximizing the margin and minimizing the classification error. A higher value of C means that the classifier will try to minimize the classification error at the expense of a smaller margin, while a lower value of C means that the classifier will try to maximize the margin even if it means more misclassifications.

The kernel-specific parameters depend on the type of kernel being used. For example, the polynomial kernel has parameters for the degree of the polynomial and the coefficient of the polynomial, while the RBF kernel has a parameter for the width of the Gaussian function.

We can use cross-validation to tune the parameters of the SVM. Cross-validation involves splitting the data into several subsets and training the classifier on each subset while using the remaining subsets for testing. This allows us to evaluate the performance of the classifier on different subsets of the data and choose the best set of parameters.

Example
```
from sklearn.model_selection import GridSearchCV
# define the parameter grid
param_grid ={'C':[0.1,1,10,100],'kernel':['linear','poly','rbf','sigmoid'],'degree':[2,3,4],'coef0':[0.0,0.1,0.5],'gamma':['scale','auto']}# create an SVM classifier
svm = SVC()# perform grid search to find the best set of parameters
grid_search = GridSearchCV(svm, param_grid, cv=5)
grid_search.fit(X_train, y_train)# print the best set of parameters and their accuracyprint("Best parameters:", grid_search.best_params_)print("Best accuracy:", grid_search.best_score_)
```
We start by importing the GridSearchCV module from scikit-learn, which is a tool for performing grid search on a set of parameters. We define a parameter grid that contains the possible values for each parameter we want to tune.

We create an SVM classifier using SVC() and then pass it to GridSearchCV along with the parameter grid and the number of cross-validation folds (cv=5). We then call grid_search.fit(X_train, y_train) to perform the grid search.

Once the grid search is complete, we print the best set of parameters and their accuracy using grid_search.best_params_ and grid_search.best_score_, respectively.

Output

On executing this program, you will get the following output −
```
Best parameters: {'C': 0.1, 'coef0': 0.5, 'degree': 3, 'gamma': 'scale', 'kernel': 'poly'}
Best accuracy: 0.975
```
This means that the best set of parameters found by the grid search are: C=0.1, coef0=0.5, degree=3, gamma=scale, and kernel=poly. The accuracy achieved by this set of parameters on the training set is 97.5%.

You can now use these parameters to create a new SVM classifier and test its performance on the testing set.

Pros and Cons of SVM Classifiers

Pros of SVM classifiers

SVM classifiers offers great accuracy and work well with high dimensional space. SVM classifiers basically use a subset of training points hence in result uses very less memory.

Cons of SVM classifiers

They have high training time hence in practice not suitable for large datasets. Another disadvantage is that SVM classifiers do not work well with overlapping classes.
October 4, 2025
Decision Trees Algorithm
Decision Tree Algorithm

The decision tree algorithm is a hierarchical tree-based algorithm that is used to classify or predict outcomes based on a set of rules. It works by splitting the data into subsets based on the values of the input features. The algorithm recursively splits the data until it reaches a point where the data in each subset belongs to the same class or has the same value for the target variable. The resulting tree is a set of decision rules that can be used to make predictions or classify new data.

The Decision Tree algorithm works by selecting the best feature to split the data at each node. The best feature is the one that provides the most information gain or the most reduction in entropy. Information gain is a measure of the amount of information gained by splitting the data at a particular feature, while entropy is a measure of the randomness or disorder in the data. The algorithm uses these measures to determine the best feature to split the data at each node.

The example of a binary tree for predicting whether a person is fit or unfit providing various information like age, eating habits and exercise habits, is given below −

In the above decision tree, the question are decision nodes and final outcomes are leaves.

Types of Decision Tree Algorithm

There are two main types of Decision Tree algorithm −
- Classification Tree − A classification tree is used to classify data into different classes or categories. It works by splitting the data into subsets based on the values of the input features and assigning each subset to a different class.
- Regression Tree − A regression tree is used to predict numerical values or continuous variables. It works by splitting the data into subsets based on the values of the input features and assigning each subset a numerical value.
Implementing Decision Tree Algorithm

Gini Index

It is the name of the cost function that is used to evaluate the binary splits in the dataset and works with the categorial target variable Success or Failure.

Higher the value of Gini index, higher the homogeneity. A perfect Gini index value is 0 and worst is 0.5 (for 2 class problem). Gini index for a split can be calculated with the help of following steps −
- First, calculate Gini index for sub-nodes by using the formula p^2+q^2 , which is the sum of the square of probability for success and failure.
- Next, calculate Gini index for split using weighted Gini score of each node of that split.
Classification and Regression Tree (CART) algorithm uses Gini method to generate binary splits.

Split Creation

A split is basically including an attribute in the dataset and a value. We can create a split in dataset with the help of following three parts −
- Part1: Calculating Gini Score − We have just discussed this part in the previous section.
- Part2: Splitting a dataset − It may be defined as separating a dataset into two lists of rows having index of an attribute and a split value of that attribute. After getting the two groups – right and left, from the dataset, we can calculate the value of split by using Gini score calculated in first part. Split value will decide in which group the attribute will reside.
- Part3: Evaluating all splits − Next part after finding Gini score and splitting dataset is the evaluation of all splits. For this purpose, first, we must check every value associated with each attribute as a candidate split. Then we need to find the best possible split by evaluating the cost of the split. The best split will be used as a node in the decision tree.
Building a Tree

As we know that a tree has root node and terminal nodes. After creating the root node, we can build the tree by following two parts −

Part1: Terminal node creation

While creating terminal nodes of decision tree, one important point is to decide when to stop growing tree or creating further terminal nodes. It can be done by using two criteria namely maximum tree depth and minimum node records as follows −
- Maximum Tree Depth − As name suggests, this is the maximum number of the nodes in a tree after root node. We must stop adding terminal nodes once a tree reached at maximum depth i.e. once a tree got maximum number of terminal nodes.
- Minimum Node Records − It may be defined as the minimum number of training patterns that a given node is responsible for. We must stop adding terminal nodes once tree reached at these minimum node records or below this minimum.
Terminal node is used to make a final prediction.

Part2: Recursive Splitting

As we understood about when to create terminal nodes, now we can start building our tree. Recursive splitting is a method to build the tree. In this method, once a node is created, we can create the child nodes (nodes added to an existing node) recursively on each group of data, generated by splitting the dataset, by calling the same function again and again.

Prediction

After building a decision tree, we need to make a prediction about it. Basically, prediction involves navigating the decision tree with the specifically provided row of data.

We can make a prediction with the help of recursive function, as did above. The same prediction routine is called again with the left or the child right nodes.

Assumptions

The following are some of the assumptions we make while creating decision tree −
- While preparing decision trees, the training set is as root node.
- Decision tree classifier prefers the features values to be categorical. In case if you want to use continuous values then they must be done discretized prior to model building.
- Based on the attributes values, the records are recursively distributed.
- Statistical approach will be used to place attributes at any node position i.e.as root node or internal node.
Implementation in Python

Let’s implement the Decision Tree algorithm in Python using a popular dataset for classification tasks named Iris dataset. It contains 150 samples of iris flowers, each with four features: sepal length, sepal width, petal length, and petal width. The flowers belong to three classes: setosa, versicolor, and virginica.

First, we will import the necessary libraries and load the dataset −
```
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Load the iris dataset
iris = load_iris()# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data,
iris.target, test_size=0.3, random_state=0)
```
We then create an instance of the Decision Tree classifier and train it on the training set −
```
# Create a Decision Tree classifier
dtc = DecisionTreeClassifier()# Fit the classifier to the training data
dtc.fit(X_train, y_train)
```
We can now use the trained classifier to make predictions on the testing set −
```
# Make predictions on the testing data
y_pred = dtc.predict(X_test)
```
We can evaluate the performance of the classifier by calculating its accuracy −
```
# Calculate the accuracy of the classifier
accuracy = np.sum(y_pred == y_test)/len(y_test)print("Accuracy:", accuracy)
```
We can visualize the Decision Tree using Matplotlib library −
```
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

# Visualize the Decision Tree using Matplotlib
plt.figure(figsize=(20,10))
plot_tree(dtc, filled=True, feature_names=iris.feature_names,
class_names=iris.target_names)
plt.show()
```
The plot_tree function from the sklearn.tree module can be used to plot the Decision Tree. We can pass in the trained Decision Tree classifier, the filled argument to fill the nodes with color, the feature_names argument to label the features, and the class_names argument to label the target classes. We also specify the figsize argument to set the size of the figure and call the show function to display the plot.

Complete Implementation Example

Given below is the complete implementation example of Decision Tree Classification algorithm in python using the iris dataset −
```
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Load the iris dataset
iris = load_iris()# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=0)# Create a Decision Tree classifier
dtc = DecisionTreeClassifier()# Fit the classifier to the training data
dtc.fit(X_train, y_train)# Make predictions on the testing data
y_pred = dtc.predict(X_test)# Calculate the accuracy of the classifier
accuracy = np.sum(y_pred == y_test)/len(y_test)print("Accuracy:", accuracy)# Visualize the Decision Tree using Matplotlibimport matplotlib.pyplot as plt
from sklearn.tree import plot_tree
plt.figure(figsize=(20,10))
plot_tree(dtc, filled=True, feature_names=iris.feature_names,
class_names=iris.target_names)
plt.show()
```
Output

This will create a plot of the Decision Tree that looks like this −
```
Accuracy: 0.9777777777777777
```
As you can see, the plot shows the structure of the Decision Tree, with each node representing a decision based on the value of a feature, and each leaf node representing a class or numerical value. The color of each node indicates the majority class or value of the samples in that node, and the numbers at the bottom indicate the number of samples that reach that node.
October 4, 2025
Nave Bayes Algorithm
What is Nave Bayes Algorithm?

The Naive Bayes algorithm is a classification algorithm based on Bayes’ theorem. The algorithm assumes that the features are independent of each other, which is why it is called “naive.” It calculates the probability of a sample belonging to a particular class based on the probabilities of its features. For example, a phone may be considered as smart if it has touch-screen, internet facility, good camera, etc. Even if all these features are dependent on each other, but all these features independently contribute to the probability of that the phone is a smart phone.

In Bayesian classification, the main interest is to find the posterior probabilities i.e. the probability of a label given some observed features, P(L | features). With the help of Bayes theorem, we can express this in quantitative form as follows −

P(L|features)=P(L)P(features|L)P(features)P(L|features)=P(L)P(features|L)P(features)

Here,
- P(L|features)P(L|features) is the posterior probability of class.
- P(L)P(L) is the prior probability of class.
- P(features|L)P(features|L) is the likelihood which is the probability of predictor given class.
- P(features)P(features) is the prior probability of predictor.
In the Naive Bayes algorithm, we use Bayes’ theorem to calculate the probability of a sample belonging to a particular class. We calculate the probability of each feature of the sample given the class and multiply them to get the likelihood of the sample belonging to the class. We then multiply the likelihood with the prior probability of the class to get the posterior probability of the sample belonging to the class. We repeat this process for each class and choose the class with the highest probability as the class of the sample.

Types of Naive Bayes Algorithm

There are many types of Naive Bayes Algorithm. Here we discuss the following three types −

Gaussian Nave Bayes

Gaussian Nave Bayes is the simplest Nave Bayes classifier having the assumption that the data from each label is drawn from a simple Gaussian distribution. It is used when the features are continuous variables that follow a normal distribution.

Multinomial Nave Bayes

Another useful Nave Bayes classifier is Multinomial Nave Bayes in which the features are assumed to be drawn from a simple Multinomial distribution. Such kind of Nave Bayes are most appropriate for the features that represents discrete counts. It is commonly used in text classification tasks where the features are the frequency of words in a document.

Bernoulli Nave Bayes

Another important model is Bernoulli Nave Bayes in which features are assumed to be binary (0s and 1s). Text classification with ‘bag of words’ model can be an application of Bernoulli Nave Bayes.

Implementation of Nave Bayes Algorithm in Python

Depending on our data set, we can choose any of the Nave Bayes model explained above. Here, we are implementing Gaussian Nave Bayes model in Python −

We will start with required imports as follows −
```
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
```
Now, by using make_blobs() function of Scikit learn, we can generate blobs of points with Gaussian distribution as follows −
```
from sklearn.datasets import make_blobs
X, y = make_blobs(300,2, centers=2, random_state=2, cluster_std=1.5)
plt.scatter(X[:,0], X[:,1], c=y, s=50, cmap='summer');
```
Next, for using GaussianNB model, we need to import and make its object as follows −
```
from sklearn.naive_bayes import GaussianNB
model_GNB = GaussianNB()
model_GNB.fit(X, y);
```
Now, we have to do prediction. It can be done after generating some new data as follows −
```
rng = np.random.RandomState(0)
Xnew =[-6,-14]+[14,18]* rng.rand(2000,2)
ynew = model_GNB.predict(Xnew)
```
Next, we are plotting new data to find its boundaries −
```
plt.scatter(X[:,0], X[:,1], c=y, s=50, cmap='summer')
lim = plt.axis()
plt.scatter(Xnew[:,0], Xnew[:,1], c=ynew, s=20, cmap='summer', alpha=0.1)
plt.axis(lim);
```
Now, with the help of following line of codes, we can find the posterior probabilities of first and second label −
```
yprob = model_GNB.predict_proba(Xnew)
yprob[-10:].round(3)
```
Output
```
array([[0.998, 0.002],
   [1.   , 0.   ],
   [0.987, 0.013],
   [1.   , 0.   ],
   [1.   , 0.   ],
   [1.   , 0.   ],
   [1.   , 0.   ],
   [1.   , 0.   ],
   [0.   , 1.   ],
   [0.986, 0.014]]
)
```
Pros & Cons of Nave Bayes classification

Let’s discuss some of the advantages and limitations of Naive Bayes classification algorithm.

Pros

The followings are some pros of using Nave Bayes classifiers −
- Nave Bayes classification is easy to implement and fast.
- It will converge faster than discriminative models like logistic regression.
- It requires less training data.
- It is highly scalable in nature, or they scale linearly with the number of predictors and data points.
- It can make probabilistic predictions and can handle continuous as well as discrete data.
- Nave Bayes classification algorithm can be used for binary as well as multi-class classification problems both.
Cons

The followings are some cons of using Nave Bayes classifiers −
- One of the most important cons of Nave Bayes classification is its strong feature independence because in real life it is almost impossible to have a set of features which are completely independent of each other.
- Another issue with Nave Bayes classification is its ‘zero frequency’ which means that if a categorial variable has a category but not being observed in training data set, then Nave Bayes model will assign a zero probability to it and it will be unable to make a prediction.
Applications of Nave Bayes classification

The following are some common applications of Nave Bayes classification −

Real-time prediction − Due to its ease of implementation and fast computation, it can be used to do prediction in real-time.

Multi-class prediction − Nave Bayes classification algorithm can be used to predict posterior probability of multiple classes of target variable.

Text classification − Due to the feature of multi-class prediction, Nave Bayes classification algorithms are well suited for text classification. That is why it is also used to solve problems like spam-filtering and sentiment analysis.

Recommendation system − Along with the algorithms like collaborative filtering, Nave Bayes makes a Recommendation system which can be used to filter unseen information and to predict weather a user would like the given resource or not.
October 4, 2025
K-Nearest Neighbors (KNN) in Machine Learning
K-Nearest Neighbors (KNN) Algorithm

K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used for both classification as well as regression predictive problems. However, it is mainly used for classification predictive problems in industry. The main idea behind KNN is to find the k-nearest data points to a given test data point and use these nearest neighbors to make a prediction. The value of k is a hyperparameter that needs to be tuned, and it represents the number of neighbors to consider.

For classification problems, the KNN algorithm assigns the test data point to the class that appears most frequently among the k-nearest neighbors. In other words, the class with the highest number of neighbors is the predicted class.

For regression problems, the KNN algorithm assigns the test data point the average of the k-nearest neighbors’ values.

The distance metric used to measure the similarity between two data points is an essential factor that affects the KNN algorithm’s performance. The most commonly used distance metrics are Euclidean distance, Manhattan distance, and Minkowski distance.

The following two properties would define KNN well −
- Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a specialized training phase and uses all the data for training while classification.
- Non-parametric learning algorithm − KNN is also a non-parametric learning algorithm because it doesn’t assume anything about the underlying data.
How Does K-Nearest Neighbors Algorithm Work?

K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new datapoints which further means that the new data point will be assigned a value based on how closely it matches the points in the training set. We can understand its working with the help of following steps −
- Step 1 − For implementing any algorithm, we need dataset. So during the first step of KNN, we must load the training as well as test data.
- Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be any integer.
- Step 3 − For each point in the test data do the following −3.1 − Calculate the distance between test data and each row of training data with the help of any of the method namely: Euclidean, Manhattan or Hamming distance. The most commonly used method to calculate distance is Euclidean.3.2 − Now, based on the distance value, sort them in ascending order.3.3 − Next, it will choose the top K rows from the sorted array.3.4 − Now, it will assign a class to the test point based on most frequent class of these rows.
- Step 4 − End
Example

The following is an example to understand the concept of K and working of KNN algorithm −

Suppose we have a dataset which can be plotted as follows −

Now, we need to classify new data point with black dot (at point 60,60) into blue or red class. We are assuming K = 3 i.e. it would find three nearest data points. It is shown in the next diagram −

We can see in the above diagram the three nearest neighbors of the data point with black dot. Among those three, two of them lies in Red class hence the black dot will also be assigned in red class.

Building a K Nearest Neighbors Model

We can follow the below steps to build a KNN model −
- Load the data − The first step is to load the dataset into memory. This can be done using various libraries such as pandas or numpy.
- Split the data − The next step is to split the data into training and test sets. The training set is used to train the KNN algorithm, while the test set is used to evaluate its performance.
- Normalize the data − Before training the KNN algorithm, it is essential to normalize the data to ensure that each feature contributes equally to the distance metric calculation.
- Calculate distances − Once the data is normalized, the KNN algorithm calculates the distances between the test data point and each data point in the training set.
- Select k-nearest neighbors − The KNN algorithm selects the k-nearest neighbors based on the distances calculated in the previous step.
- Make a prediction − For classification problems, the KNN algorithm assigns the test data point to the class that appears most frequently among the k-nearest neighbors. For regression problems, the KNN algorithm assigns the test data point the average of the k-nearest neighbors’ values.
- Evaluate performance − Finally, the KNN algorithm’s performance is evaluated using various metrics such as accuracy, precision, recall, and F1-score.
Implementation of KNN Algorithm in Python

As we know K-nearest neighbors (KNN) algorithm can be used for both classification as well as regression. The following are the recipes in Python to use KNN as classifier as well as regressor −

KNN as Classifier

First, start with importing necessary python packages −
```
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
```
Next, download the iris dataset from its weblink as follows −
```
path ="https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
```
Next, we need to assign column names to the dataset as follows −
```
headernames =['sepal-length','sepal-width','petal-length','petal-width','Class']
```
Now, we need to read dataset to pandas dataframe as follows −
```
dataset = pd.read_csv(path, names=headernames)
dataset.head()
```
slno. sepal-length sepal-width petal-length petal-width Class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

Data Preprocessing will be done with the help of following script lines −
```
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,4].values
```
Next, we will divide the data into train and test split. Following code will split the dataset into 60% training data and 40% of testing data −
```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40)
```
Next, data scaling will be done as follows −
```
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
```
Next, train the model with the help of KNeighborsClassifier class of sklearn as follows −
```
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=8)
classifier.fit(X_train, y_train)
```
At last we need to make prediction. It can be done with the help of following script −
```
y_pred = classifier.predict(X_test)
```
Next, print the results as follows −
```
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)print("Confusion Matrix:")print(result)
result1 = classification_report(y_test, y_pred)print("Classification Report:",)print(result1)
result2 = accuracy_score(y_test,y_pred)print("Accuracy:",result2)
```
Output
```
Confusion Matrix:
[[21 0 0]
[ 0 16 0]
[ 0 7 16]]
Classification Report:
        precision      recall       f1-score       supportIris-setosa       1.00        1.00         1.00          21
Iris-versicolor   0.70        1.00         0.82          16
Iris-virginica    1.00        0.70         0.82          23
micro avg         0.88        0.88         0.88          60
macro avg         0.90        0.90         0.88          60
weighted avg      0.92        0.88         0.88          60


Accuracy: 0.8833333333333333
```
KNN as Regressor

First, start with importing necessary Python packages −
```
import numpy as np
import pandas as pd
```
Next, download the iris dataset from its weblink as follows −
```
path ="https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
```
Next, we need to assign column names to the dataset as follows −
```
headernames =['sepal-length','sepal-width','petal-length','petal-width','Class']
```
Now, we need to read dataset to pandas dataframe as follows −
```
data = pd.read_csv(url, names=headernames)
array = data.values
X = array[:,:2]
Y = array[:,2]
data.shape

output:(150,5)
```
Next, import KNeighborsRegressor from sklearn to fit the model −
```
from sklearn.neighbors import KNeighborsRegressor
knnr = KNeighborsRegressor(n_neighbors=10)
knnr.fit(X, y)
```
At last, we can find the MSE as follows −
```
print("The MSE is:",format(np.power(y-knnr.predict(X),2).mean()))
```
Output
```
The MSE is: 0.12226666666666669
```
Pros and Cons of KNN

Pros
- It is very simple algorithm to understand and interpret.
- It is very useful for nonlinear data because there is no assumption about data in this algorithm.
- It is a versatile algorithm as we can use it for classification as well as regression.
- It has relatively high accuracy but there are much better supervised learning models than KNN.
Cons
- It is computationally a bit expensive algorithm because it stores all the training data.
- High memory storage required as compared to other supervised learning algorithms.
- Prediction is slow in case of big N.
- It is very sensitive to the scale of data as well as irrelevant features.
Applications of KNN

The following are some of the areas in which KNN can be applied successfully −

Banking System

KNN can be used in banking system to predict weather an individual is fit for loan approval? Does that individual have the characteristics similar to the defaulters one?

Calculating Credit Ratings

KNN algorithms can be used to find an individual’s credit rating by comparing with the persons having similar traits.

Politics

With the help of KNN algorithms, we can classify a potential voter into various classes like “Will Vote”, “Will not Vote”, “Will Vote to Party ‘Congress’, “Will Vote to Party ‘BJP’.

Other areas in which KNN algorithm can be used are Speech Recognition, Handwriting Detection, Image Recognition and Video Recognition.
October 4, 2025