An intro to understanding kmeans with an implementation in R and choosing the best K

Clustering algorithms in Machine Learning are unsupervised techniques (those that have input data without labelled responses). Their objective is to draw data patterns and cluster data observations into different groups based on their similarities. K-Means Clustering is one way of implementing a clustering algorithm that successfully summarizes high dimensional data.

K-means clustering partitions a group of observations into a fixed number of clusters that have been initially specified based on their similar characteristics.

Photo by Vino Li on Unsplash

However, the question arises, to group observations:

1) What does it mean for things to be similar to each other?

2) How do we determine things are…

What is dimension reduction and how can we use principal component analysis in R to determine the important features

When you’re working in data science and analytics, handling high dimensional data is a part of it. You may have a dataset with 600 or even 6000 variables, with some columns that prove to be important in modelling while some that are insignificant, some correlated to each other (i.e. weight and height) and some entirely independent of one another. Knowing very well how the use of thousands of features is both tedious and impractical for our model, our objective lies in creating a dataset with a reduced number of dimensions (all uncorrelated) explaining as much variation in the original dataset…

A guide to understanding clustering techniques, its applications, pros & cons and creating Dendrograms in R.

In the early stages of performing data analysis, an important aspect is to get a high level understanding of the multi-dimensional data and find some sort of pattern between the different variables- this is where clustering comes in. A simple way to define hierarchical clustering is:

`partitioning a huge dataset into smaller groups based on similar characteristics that would help make sense of the data in an informative way.`

Image via @jeremythomasphoto on

Hierarchical Clustering can be classified into 2 types:

· Divisive (Top-down) : A clustering technique in which N nodes belong to a single cluster initially and are then broken down into…

Exploratory Data Analysis is a major component of Data Science. It helps you deduce data patterns and understand data properties in a ‘quick and dirty’ way. Graphical analysis by the Base Plotting System is majorly divided into two parts: 1) Graph generation (initializing the plot) and 2) Graph annotation (setting its properties, attributes, axes etc).

This article is focused on the introduction to EDA through a course project using the ‘Individual household electric power consumption Data Set’ from UC Irvine Machine Learning Repository (a repo for Machine Learning projects). …

A data scientist/analyst in the making needs to format and clean data before being able to perform any kind of exploratory data analysis. Because when you have raw data, it has numerous problems that need fixing.

So when we say we are cleaning data into a tidy data set to be used for analysis later, we are actually (among many other things):

1. Removing duplicate values

2. Removing null values

3. Changing column names to readable, understandable, formatted names

4. Removing commas from numeric values i.e. (1,000,657 to 1000657)

5. Converting data types into their appropriate types for analysis


Maria Gulzar

Passionate for data science, heart-breaking books, amateur writing and the joy of cooking! Ex-Editor-in-Chief, Scribes (

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store