What is dimension reduction and how can we use principal component analysis in R to determine the important features

When you’re working in data science and analytics, handling high dimensional data is a part of it. You may have a dataset with 600 or even 6000 variables, with some columns that prove to be important in modelling while some that are insignificant, some correlated to each other (i.e. weight and height) and some entirely independent of one another. Knowing very well how the use of thousands of features is both tedious and impractical for our model, our objective lies in creating a dataset with a reduced number of dimensions (all uncorrelated) explaining as much variation in the original dataset…

A guide to understanding clustering techniques, its applications, pros & cons and creating Dendrograms in R.

In the early stages of performing data analysis, an important aspect is to get a high level understanding of the multi-dimensional data and find some sort of pattern between the different variables- this is where clustering comes in. A simple way to define hierarchical clustering is:

`partitioning a huge dataset into smaller groups based on similar characteristics that would help make sense of the data in an informative way.`

Image via @jeremythomasphoto on unsplash.com

Hierarchical Clustering can be classified into 2 types:

· Divisive (Top-down) : A clustering technique in which N nodes belong to a single cluster initially and are then broken down into…

Exploratory Data Analysis is a major component of Data Science. It helps you deduce data patterns and understand data properties in a ‘quick and dirty’ way. Graphical analysis by the Base Plotting System is majorly divided into two parts: 1) Graph generation (initializing the plot) and 2) Graph annotation (setting its properties, attributes, axes etc).

This article is focused on the introduction to EDA through a course project using the ‘Individual household electric power consumption Data Set’ from UC Irvine Machine Learning Repository (a repo for Machine Learning projects). …

A data scientist/analyst in the making needs to format and clean data before being able to perform any kind of exploratory data analysis. Because when you have raw data, it has numerous problems that need fixing.

So when we say we are cleaning data into a tidy data set to be used for analysis later, we are actually (among many other things):

1. Removing duplicate values

2. Removing null values

3. Changing column names to readable, understandable, formatted names

4. Removing commas from numeric values i.e. (1,000,657 to 1000657)

5. Converting data types into their appropriate types for analysis


Maria Gulzar

Passionate for data science, heart-breaking books, amateur writing and the joy of cooking! Ex-Editor-in-Chief, Scribes (medium.com/scribes)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store