Exploratory Data Analysis: Classifying the Species of Iris Flower

Segmenting the Iris dataset based on the optimum number of clusters, using K-Means Clustering in Unsupervised Learning.

10 min readJul 11, 2021

The Iris flower. Photo by Christina Brinza on Unsplash

Introduction

Background — Unsupervised Machine Learning

Unsupervised Learning is a branch of Machine Learning (ML), where the model is trained using unlabelled data. In this technique, the machine classifies materials on its own with no human intervention by detecting the characteristics of data. Manual classification is not involved in this method, i.e. only input variables (x) are given with no corresponding output variables, thus more errors occur upon final predictions. For the practitioner to achieve a certain level of correctness in this technique, the integration of “reinforcement learning” is often useful.

In Unsupervised learning, the algorithm searches for previously unknown patterns within a dataset containing unlabeled features. When it comes to unearthing similarities or anomalies in data sets, Unsupervised learning tends to be much more helpful than Supervised learning. It also has greater reliability. Amongst the different methods of unsupervised learning, the most prominent methods of unsupervised learning are:

Clustering: K-Means, Hierarchical Cluster Analysis (HCA)
Dimensionality Reduction: Principal Component Analysis (PCA)

On the LHS is supervised learning where regression techniques are used to find a line of best fit between features. On the RHS is unsupervised learning where the inputs are segregated based on features & prediction is based on which cluster it belongs to.

A snippet of how Unsupervised Learning works and what problems it is best suited for. Source: Google Images.

The ability to discover unknown trends is invaluable and has countless applications in many industries. Some real-world applications include but not limited to:

Anomaly Detection, such as fraudulent transactions, cyber-anomalies, policy violations, etc.
Product Recommendation Engines, used by many organizations across different sectors like Amazon, Netflix, Spotify, etc.
Image & Pattern Recognition, such as tagging suggestions on Facebook, labeling an X-Ray report as cancerous or not, etc.

Application

A. Use Case — Objective

Irises, originating from the Iridaceaeare family, are easy-to-grow perennial plants, with around 300 different species. Amongst them lie three specific species called Setosa, Versicolor, and Virginica which are all identical looking. This makes it highly difficult to classify them individually, bearing in mind how some of the physical appearances can be used to identify between different flowers and species.

This project aims to provide a comprehensive overview and analysis of a typical test case. Statistical classification techniques will be used to predict the number of optimum clusters from a given dataset, based on the number of species of an Iris Flower. Data with relevant parameters such as petals size, petal length, etc. will be utilized. Machine Learning will be deployed to predict the output, after studying the hidden trends and patterns of our dataset. To implement the functions in this machine learning algorithm, Python libraries will be brought into action.

The goal is to perform exploratory analysis on the dataset, build a K-Means Clustering model, predict the optimum number of clusters, and represent it visually.

B. Description of Data

The Iris Flower data set was brought into being by British biologist and statistician Ronald Fisher, in his paper called “The Use of Multiple Measurements in Taxonomic Problems”. It was published back in 1936. This data set is a multivariate data set and is conceivably the best dataset known to be found in pattern recognition.

Based on Fisher’s linear discriminant model, this became a typical test case for many statistical classification techniques in machine learning. Thus, the data set is a proven example to explain the difference between supervised and unsupervised techniques in data mining.

Our data below is a CSV (comma-separated values) file provided for this analysis.

A snippet of the CSV file of our dataset.

The raw data was cleaned, modified, and given a well-formed look for general interpretation. As seen above, The Iris Dataset contains four features:

SepalLength, i.e. the length of Sepal in centimeters (cm).
SepalWidth, i.e. the width of Sepal in centimeters (cm).
PetalLength, i.e. the length of Petal in centimeters (cm)
PetalWidth, i.e. the width of Petal in centimeters (cm)

This data set comprises 50 samples from each of the three species of Iris, i.e Setosa, Virginica, and Versicolor.

A snippet of the three Iris species used in this model. Source: Google Images

The combination of the four features can be used for the training and evaluation of machine learning models. This dataset is often used in other data mining, classification, and clustering examples.

C. Libraries Used

Python Libraries are a set of useful functions that eliminate the need to write any code from scratch. As of today, there are over 137,000 libraries present in Python. However, we will only be utilizing five for our task; Pandas, NumPy, Matpotlib, Seaborn, and Scikit-Learn.

Pandas is an open-source software library written for the Python programming language for high-performance data manipulation and analysis. In particular, it offers powerful data structures and operations for manipulating numerical tables and time series. Python with Pandas is used in various fields, including finance, economics, and other commercial domains.
NumPy is a library adding support for large, multi-dimensional arrays and matrices and a collection of routines for processing those arrays. NumPy also provides a large collection of high-level mathematical functions and logical operations to perform on these arrays.
Matplotlib is a plotting library and one of the most popular Python packages used for data visualization. It provides an object-oriented API for making 2D plots from data in arrays.
For analyzing a set of data in Python, we can also utilize the Seaborn library. It is a visualization library established on the foundations of Matplotlib. In the world of Analytics, the best way to get insights is by visualizing the data. Data can be visualized by representing it as plots that are easy to understand, explore and grasp. Such data helps in drawing the attention of key elements. We use Seaborn as it complements and extends Matplotlib.
Scikit-learn is a machine learning library used to build models as it has tons of tools used for predictive modeling and analysis. It features various algorithms like support vector machines, random forests, and k-neighbors.

This project gives Data Scientists & Engineers the opportunity to apply their knowledge of data science and sequentially go through the processes their profession pertains — from retrieving raw data from external sources, to parsing and cleaning the data, followed by analytical assessment through Machine Learning algorithms and tools. The evaluation from the final analysis leads to a conclusion which can then be leveraged by stakeholders. As this project has a lot of aspects to be considered, it is open for discussion.

Building a Prediction Model

A. Analytical Approach

Live visualization of K-Means Clustering algorithm.

Unsupervised Machine Learning will be applied to predict the number of clusters formed. For visualizing the analysis, we use a clustering technique called K-Means. K-Means is a common machine learning algorithm used to cluster data points based on similar characteristics. The algorithm is fast and efficient for a medium and large-sized database and is useful to quickly discover insights from unlabeled data. By observing and examining every cluster, we can then determine those categories that distinguish them from one another.

B. Data Analysis

I. Importing Libraries & Loading Data

We will import the libraries involved. Please note that Scikit-Learn will be imported later on.

A snippet of code to import libraries involved.

The next step is to load the dataset into the Python Interpreter hosted on Jovian, to proceed with the training of the model. Pandas are used to load the CSV file and give a confirmation when the data is loaded successfully.

Note: I had saved the file in the same directory as this very Interpreter.

A snippet of code indicating that our data has been imported successfully.

II. Visualizing Raw Data & Gaining Insights

As we proceed, it is customary to check a summary of the technical information of our dataset. The info() function is used which prints a concise summary of a specific DataFrame. This function provides us with information like the index type, non-null values, etc. about a DataFrame.

A snippet of code displaying information regarding our DataFrame by using the info() function.

Based on the information given in the snippet of code above, we can establish that there are 4 columns, each representing a feature of the three species individually:

SepalLength (cm)
SepalWidth (cm)
PetalLength (cm)
PetalWidth (cm)

There are a total of 50 values for each species. Thus, it can be said that there are 150 samples and 4 features in the data being fed to the machine learning model. The type of data (Dtype) is float.

Our dataset can be viewed using the head() function.

A snippet of code displaying the data. A confirmation message is printed at the end.

Notice how the head() function only previews the top five elements by default. This can be customized by simply adding the number of elements that are required to be seen between the parenthesis.

A snippet of the dataset’s first ten elements is shown.

The tail() function displays the last five samples in our dataset.

Being a Data Scientist requires a different combination of skills, such as technical skills, functional skills, and soft skills. Having a good sense of numbers is part of functional skills. A Data Scientist should be able to analyze and interpret what the numbers are saying. This requires a concrete base in the concepts of statistics. Fortunately, the describe() function provides a set of important values for further statistical analysis.

A snippet of code showing important statistical features of the dataset.

III. Plotting the Data

The third and essential part of Data Analysis is to enter the distribution scores and plot them according to the requirement. The data points are plotted on a 2-D pair plot to visualize the attributes of the dataset and its dependency on each other in one command.

A snippet of code showing the plotting of a 2-D pair plot to identify relationships in the dataset.

PetalLength and PetalWidth appear to be positively correlated, i.e. having a linear relationship. Iris Setosa seems to have a smaller petal length and petal width as compared to others. Overall, it seems to be the case that Iris Setosa has smaller dimensions than other flowers. Thus, it can be conclusively said that the Iris Setosa makes a distinctive cluster in every parameter, while the other two species tend to overlap each other to a certain extent.

A snippet of code showing the correlation matrix quantitatively examining the relationship between variables.

The petal measurements have a highly positive correlation, while the sepal ones are uncorrelated. Worth mentioning that the petal features also have a relatively high correlation with SepalLength, but not with SepalWidth.

Some important points to remember while interpreting a correlation matrix:

-1 indicates a perfect negative linear correlation between two variables.
0 indicates no linear correlation between two variables.
1 indicates a positive linear correlation between two variables.

A snippet of code showing the Countplot. There are 50 samples of each species.

A snippet of code showing the scatterplot using Matplotlib library, of SepalLength and PetalLength.

The above scatterplot is comprised of two variables, SepalLength on the x-axis and PetalLength on the y-axis. If the value along the y-axis seems to increase as the x-axis increases(or decreases), it indicates a positive (or negative) linear relationship.

A snippet of code showing the scatterplot using Matplotlib library, of SepalLength and SepalWidth.

A snippet of code showing the scatterplot using Matplotlib library, of PetalLength and PetalWidth.

C. Finding Optimal No. of Clusters — Elbow Method

In this process, the first step is to take the values for attributes X (input) from the data.

A snippet of code where values of X are taken from data.

Now that our attributes have been established, the next step is to identify the optimum number of clusters from our data. This is done using the Elbow Method in the Scikit-Learn library.

A snippet of code illustrating the Elbow plot. The x-axis represents the Number of Clusters (K) and the y-axis represents the within-cluster Sum of Square, aka Inertia value.

Within-Cluster Sum of Square, or Cluster Inertia, is defined as the sum of squared distances of data points to their closest centroid, i.e. center of cluster. Hence, the smaller the inertia, better the compactness of the clusters, i.e. the points are closer together.
Inertia decreases as we increase the value of k. An algorithm with two clusters will always have a higher inertia score as compared to the one with four clusters, and so on so forth. At this juncture, we use a combination of “elbow plot” & “silhouette score”, to decide the optimum value of k.

It can be clearly seen why it is called ‘The Elbow Method’ from the above snippet of the graph. The optimum number of clusters is where the “elbow” point occurs, i.e. at “3”. This is when the Within-Cluster Sum of Squares (WCSS) does not decrease significantly with every iteration. So, it can be concluded that “3” is the best value for k to be used to create the final model.

D. Prepare Data for Machine Learning Algorithm

Data today is present in a host of different formats, structures, and sources. An integral part of a Data Scientist’s job is to prepare this very data by cleaning, organizing, and optimizing for use by end-users. End-users include business stakeholders, analysts, and programmers. The “prepared” data is then used to interpret the results and relay information for the management to make better-informed decisions.

A snippet of code indicating there are no null values.

E. Creating the K-Means Classifier

The final model is built using K-Means and k = 3. The n_init has also been defined as 10 which refers to the number of times the K-Means algorithm will be run with the different initial centroid. The max_iter parameter has been set to 300, as it is the maximum number of iterations there can be to find the final clusters when the K-Means algorithm is running.

F. Visualising the Clusters

There are three clusters being made in actual data. According to the results, our machine learning model is suggesting that there is some overlapping in Iris Setosa and Iris Virginica.

G. Predicting Clusters

First, we will identify the center points of the data.

Next, we predict clustered results and compare them with our original data.

Conclusion

The project’s main goal was to use the Iris dataset for introductory machine learning development. It also served as a basis for developing a better understanding of how the K-Means algorithm works and what goes into its training and evaluation. Like everything else, there are many advantages and disadvantages when it comes to deploying this algorithm.

Advantages

Simple and computationally efficient.
Highly intuitive with results easy to visualize.

Disadvantages

It is highly dependent on scale, thus not being suitable for data of varying levels, shapes, and densities.
Evaluating results is inclined towards subjectivity, requiring much more human evaluation and interpretation than other, far more trusted metrics.

Thank you,

Usman Aftab Khan

Note: Everything in this article is documented in my GitHub repository. Please do pay a visit if you’re interested in deciphering the full code.

Important Terminologies

Feature: An input variable used in making predictions.
Predictions: A model’s output when provided with an input example.
Example: One row of a dataset. An example contains one or more features and possibly a label.
Label: Result of the feature.
Multivariate: Two or more individual measurements taken as variables.