Data Pre-processing

neovijayk
Jul 6, 2020
2 min read

Raw Data can be structured or unstructured. Pre-processing prepares the input data required for the analysis. This is one of the important task in the data science that should be done carefully

Dig: Major steps in Data Science

Before Pre-processing:

Handling zip files: explanation of steps
to create zip file using shutil
to unzip the input .zip file using pyunpack in Python
Create a Check List
From the previous project experience you can create a Check list or lists having list of items that you find needed to be performed on raw data before using it for the project
Items in the list can be to check data quantity, quality, gaps, etc
Hence for the new or existing project we can use some or all of the items of the Check list as a litmus test on the new raw data that should be passed before using it in the project
This can help to speed up pre-prcessing, processing steps
This can address hidden problems related to raw data if present in very early stage hence will save time and resources in later stage of development
Data and Statistics: (coming soon)
mean, median, mode, 1st-2nd-3rd quartile, percentiles, standard deviation, range of the data
Data distributions and Probability distributions: (coming soon)
frequency distribution and probability distribution

Some useful Pre-processing techniques:

Dig: Making Raw data ready for the analysis

Data Cleaning (Coming soon)
Data imputing
Useful String operations implementation examples. (Coming soon)

Image Processing:

Some of the useful techniques in image processing:

Scale, Blur , Gray image using OpenCV , PIL (coming soon)
Image rotation/tilting by x angle using Scipy , imutils and OpenCV functions.
Automation of tilt or inclination or rotation angle detection and correction of an image using Canny edges detection and Hough lines detection techniques.
Automatic White border or padding detection and cropping in an image

Divide features into training and testing:

random.seed(): What does it do? (coming soon)
Handle Categorical Features. (coming soon)
Feature selection (coming soon)
Divide features into training and testing. Get index information of train and test features.