Data Analytics

neovijayk
Jul 6, 2020
5 min read

Dig: Major steps in Data Science

In this article I am going to explain some of the important steps generally we follow in most of the Data Science Project development. Data analysis is done to solve problem or problems from the insights gained after processing available data.

In this article we will take a look at some of the steps generally we take during data analysis. These steps are explained below.

Dig: Analyse the prepared data

STEP 1: Know your data Or Study data in detail:

Many of the times developers ignore or make assumptions regarding data (about quality, definitions of the fields or columns or features) and with that they start to develop a code and when they got stuck somewhere they spent lot of time on finding problems in their code and later find out the problem was in the data itself :D. During this process they waste their valuable time and resources.

Hence Studying data is the most important thing to do in the Data Science. Real world data is really messy. Since lots of impurity may get added during journey of data from collection to storage and back to analysis. You have to make sure your data is good to go for the analysis.

First you should know how data was collected (briefly the process)? How data was stored (platform, format used)? What was the purpose of data collection(for the Organisation)? Definition of each field in the data? Verify the assumption made about any data fields in the data. Getting precise answers and recording it for the future reference as the dictionary is most important thing that as a Data Scientist/Analyst we should do. Here we need help of domain experts or domain knowledge to verify our assumptions, definitions, etc about data.

This kind of data investigation will help in every state of project development from Pre-processing to finally gain insights from the output. After getting this information one can start to study the data to find any gaps in it and to get more general information about the data and the fields in it.

Lets understand this with simple example. For the explanation purpose I have implemented following techniques to study the data:

By getting Basic information about data using pandas functions
By using using graphs using Pandas plot() function,
By finding any gaps, outliers using box plot, dbscan clustering method

To see the example implementation refer this GitHub repository having example code, data.

Similar list of techniques you can use to check data quantity, quality or gaps of new input data. Note that this list of techniques will increase or decrease as per project requirements and also with your experience with time.

STEP 2 Handle missing or empty or Nan data

Remove such data points or fields or features if that is not going to impact the algorithm or not going to contribute to solve the problem.

But what if the that is important and will be required to make judgement or will be used as the input feature in our algorithm. Then in such a case removal will not work. In such a case we can use some of the imputing techniques. I have explained few of the methods in the following article:

Data imputing using interpolate, fillna – Time series example

Imputing should be done in such a way so that it will contribute in improving the output.

STEP 3: In depth Data analysis, Derive new input features

Advanced analytics allows us to deep dive into data. This can be used to derive new features from the existing raw data columns/fields.

Now we will take a look at some of the useful techniques:

Exploring plot() function’s useful features in depth
Detailed analysis with Pandas functions for analysis: Pivot table, Group-by and merge, melt (coming soon)

These derived features can be used to gain further insights. Note that Many of the times its most likely that you may find or end up deriving features and applying simple statistical techniques on them can solve the problems that we required to solve instead moving to deep learning or machine learning algorithms to solve them.

Therefore always be open or flexible enough to think for the possible solutions other than machine learning and deep learning or mix of both before starting the project or during your project development.

Data Preparation:

Now suppose that we have prepared the input table for our algorithm but we want to further study the prepared features then we can use following steps for that purpose.

Correlation Matrix
Find out correlation between different input features. Does it make sense to use them as the input to the machine learning algorithm?
For more explanation refer this article
Find out causality.
If in the output we found a particular pattern or behaviour then try to find the factors causing such a behaviour or pattern in the output. E.g. Users visiting a store are in the Age group of 15-50. But we found out Users with Age 20 end up as a buyers more frequently than Users with Age 30. Why? There can be some causing factor.

STEP 4: Study Output. Analysis to gain insights from the data

We know the main goal of data analysis is to gain insights from the data. And it’s not necessary that all the insights will be gained at the end of all the above steps from 1 to 3. Most of the times some of the important insights from the data can be learned during implementation of the above three steps.

Hence after every step implementation from pre-processing to testing in data science always question what we can see from the results. Our job is not just implementing techniques and tools to generate outputs from the data but it’s equally important to decode or understand that output to gain important insights. And that can be done careful investigation and study of the output.

Is it true that insights gained from the data always solves the problems that data scientists supposed to solve?

Simple and straight answer is NO.
It’s very true there are many problems that industries or organizations currently facing which initially thought can completely be solved using data science alone but in current reality if we see the most of the data science projects are not able to solve the 100% problems.
It’s most likely that you may end up doing some traditional hard coded rule or heuristics based solutions along with Data science techniques. This is because of the limitation of data, data analysis techniques, available resources (human resources, machines-computation power, finance, domain expertise) and time or any other reasons.
Therefore it’s most likely that In Data Science projects you may end up doing hybrid solutions which have some or most problems tackled using data science techniques and some using rule based or heuristics based solutions.
Note that It’s possible that data science projects may end up in failure that not able to solve problems faced at all. Therefore from the start and during development steps of the project always ask whether project is leading to the solutions that we want or should we stop it to avoid wastage of resources.

This is it. I tried to keep this article as small as possible. I hope this will help you to learn some useful steps in the the data analytics.

Some Useful Data Analysis information:

Sharing some useful articles, videos:

Gartner Identifies Top 10 Data and Analytics Technology Trends for 2019: Augmented analytics, continuous intelligence and explainable artificial intelligence (AI) are among the top trends in data and analytics technology that have significant disruptive potential over the next three to five years, according to Gartner, Inc.

Useful readings:

If you have any questions feel free to ask in the comment section. Also please like and subscribe to my blog :).