Python for Data Science

neovijayk
Jul 6, 2020
6 min read

Dig: Major steps in Data Science

Hello friends, on this page I will share some important links, information useful for those who wants to learn Python for data science to gain insights from the data. I will discuss some of the important sources below which will help you to learn Python for data science by your self.

Editor for the Python

First we will need a good Editor to write python code, debug and execute it:

You can use Anaconda-jupyter notebook (click for the installation steps)
It is free to use, easy to install and it keeps all the packages or libraries of Python updated with time.
But depending upon your comfort level you can choose other available editors as well.

Also you can check Google Colaboratory(Colab). It is a free Jupyter notebook environment that requires no setup, and runs entirely (writing, running, & sharing code) on the Cloud. Google even provide free GPU therefore you can run or practice or test some deep learning code over the cloud based platform as well.

After editor is installed you can read about the basics of Python such as different Data Structures in Python, Implement them on the Jupyter notebook. Also for your refer you can refer this Cheet sheet (I found it on the net).

AI cheat sheet for Python basics

Some important Libraries to start with:

If you are new to Python then to start with one can start with these very important Python libraries:

Pandas:
Just start with some basics functions to perform some tasks using Pandas such as read CSV/excel files, save them, print column etc
The name ‘Pandas’ is derived from the term “panel data” (an econometric’s term for multidimensional structured data sets)
Numpy:
It is mainly used in mathematical computations,
Numpy is much faster whenever possible try to use Numpy for the computation purpose
it address the slowness problem partly by providing multidimensional arrays and functions and operators that operate efficiently on arrays
therefore start to play with this library with it’s basic functions used for the mathematical computations
Matplotlib:
Mainly used for the visualization purpose,
Using this try to print the graphs, data in Pandas, etc
pyplot is a matplotlib module which provides a MATLAB-like interface.
matplotlib is designed to be as usable as MATLAB, with the ability to use Python, with the advantage that it is free
Scipy:
SciPy builds on the NumPy array object and is part of the NumPy stack (includes tools like Matplotlib, pandas and SymPy, and an expanding set of scientific computing libraries)
The NumPy stack is also sometimes referred to as the SciPy stack.

Initially Introduction and basic understanding with some hands on implementation of these libraries in Python code is important. With time as you practice more and more you will get deeper understanding of how these libraries works and what function are important and how to use them in various steps/phases of data science from pre-processing to gaining data insights and in visualization.

How to learn Python for Data Science by myself?

You will find lots of blogs, websites from where you can start learning. There are some websites from where you can learn:

Courses on Coursera, Edx
To see how to apply for the Financial Aid on Coursera: Link
YouTube channels
Blogs, published Papers,
Books

From above four sources first three you can access from the internet free of cost. But remember that while learning try to execute the codes or examples by your self.

Websites where I can get the answers for my questions that may come during code implementation?

StackOverFlow:
If you check my profile on StackOverFlow you will find I have given answers to some of the question that people ask on this forum.
Same way you can search for the problems or questions that you will face during the project. From this website you can get the answers from such forums.
Github:
From the Github you can get the code, data and explanation to practice your self
If you check my Github code repository I am uploading the code that you can practice by your self to get idea during reading my articles.
Same way you can find some good example codes on the GitHub.

After studying Pythons for few days you may say now I am ready to start working on ML projects. But wait I don’t have data and also problem statement LOL :D. Don’t worry you can refer to the following website where you can find the both:

Kaggle: on this website you will find data + problem statement + solutions uploaded by other people
Since we download or upload data in zip files from this site therefore to know how to unzip these files in Python refer this link.

Before Starting a New project what should I do?

Create a new virtual environment for the project.
Always look the version of packages or libarires that you are going to use for the project. To know how to install new packages and check their versions please refer to this article where I have explained how to install different required packages from Jupyter notebook it self: link

Above list of resources are more than enough to gain good grip of python for machine learning. Though I am intentionally keeping list of resources mentioned above short at the same time you can refer other sources that are available on the internet as per your requirements. But to start with Data Science above information will help you to improve your knowledge about python and how to use it for Data Science for sure.

Now let’s take a look at some important libraries in Python and where these libraries are used generally in data science projects:

Python for data pipeline:

Generally in big organisations data pipeline development job is done by the Data Engineers but as most of the data analysts/scientists working in the start ups (in India ;D) we have to do this job mostly using some tools or using Python code which executes and fetch data from the Client’s data source.
In Python we have APIs available to connect, interact and fetch data from the Clint’s database.
Initially it takes time to learn and establish data pipeline from the data source (database server, cloud based servers, etc) to data destination using Python code but after some experience with time in data pipeline development you will take less time to establish new data pipeline for the new projects and also you will start to enjoy the process ;D
In this phase you need to develop expertise in how to handle Python APIs, Authentication process, different operations using different APIs etc. This will come with experience and your projects needs.

Python for initial data study:

Now this is important you should study data of the client even before starting the Project or at the time of creating Use cases (if you can access the data from the client) to propose to the client as Data Science Use Cases to flourish there businessmen.
During this initial phase we need to study sample of available captured data by the client, it’s quality, quantity, basic statics, gaps in the data and many more things since if we don’t have enough good data for analysis then it is not good idea to proceed further.
Hence in some part of Data Study we might use Python to explore information about data further and for that you should have expertise in the Python libraries such as Pandas, Numpy, Matplotlib. And if data is image then OpenCV, Scipy, PIL, pydicom, etc.

Python for pre-processing:

Mostly in data science we use Python libraries such as Pandas, Numpy, scikit-learn, datetime, etc for the pre-processing purpose and if dataset consists of images then we use OpenCV, Matplotlib, Numpy, Scipy, PIL, etc.

Getting insights from the Data:

Data Analysis to gain insights from the data using Python:

In data analysis where we use Pandas, Numpy, Matplotlib libraries to gain useful insights from the data that directly can be used as a input for the business decisions for example top sold brands in last 15 days, latest trend in the customer purchase, visits, etc

Machine Learning to gain insights from the data using Python:

Here comes the most used library for machine learning algorithms it is scikit learn
Using scikit learn we can directly call the models as a functions to train and test, Scikit learn also provides function for preprcoessing and dividing data into training and testing datasets which is given as the input to the model.

Deep Learning using Python:

Before jumping to deep learning it is important to get good exposure to Python and its libraries like Numpy, Scipy, Pandas, Matplotlib; frameworks like Theano, TensorFlow, Keras also OS like Windows, any Linux distribution (example: Ubuntu), prior basic knowledge of Linear Algebra, Calculus, Statistics and basic machine learning techniques.
As Deep Learning algorithms require high computations hence knowing some basics of hardware requirements (RAM, GPU, etc) depending upon data and algorithms is also important.

Useful Features of Jupyter Notebook:

Auto-complete feature in Jupyter notebook
Progress Bars in Python (and pandas!) by Peter Nistrup. Link

Some other interesting reads related to Python that you may find useful:

5 Python features I wish I had known earlier by Eden Au. Link
How should I start learning Python? Link
Which is the best book for learning python for absolute beginners on their own? Link
Why do some people say that object-oriented programming in Python is a joke? Link
Cheet-sheets for AI by Stefan Kojouharov. Link
Machine Learning Books you should read in 2020 by Przemek Chojecki. Link
Combining Pandas DataFrames: The easy way by Benedikt Droste. Link
What is the difference between join and merge in Pandas? Link