Data Pre-processing
- neovijayk
- Jul 6, 2020
- 2 min read
Raw Data can be structured or unstructured. Pre-processing prepares the input data required for the analysis. This is one of the important task in the data science that should be done carefully

Dig: Major steps in Data Science
Before Pre-processing:
- Handling zip files: explanation of steps 
- Create a Check List 
- From the previous project experience you can create a Check list or lists having list of items that you find needed to be performed on raw data before using it for the project 
- Items in the list can be to check data quantity, quality, gaps, etc 
- Hence for the new or existing project we can use some or all of the items of the Check list as a litmus test on the new raw data that should be passed before using it in the project 
- This can help to speed up pre-prcessing, processing steps 
- This can address hidden problems related to raw data if present in very early stage hence will save time and resources in later stage of development 
- Data and Statistics: (coming soon) 
- mean, median, mode, 1st-2nd-3rd quartile, percentiles, standard deviation, range of the data 
- Data distributions and Probability distributions: (coming soon) 
- frequency distribution and probability distribution 
Some useful Pre-processing techniques:

Dig: Making Raw data ready for the analysis
- Data Cleaning (Coming soon) 
- Useful String operations implementation examples. (Coming soon) 
Image Processing:
Some of the useful techniques in image processing:
- Scale, Blur , Gray image using OpenCV , PIL (coming soon) 
- Image rotation/tilting by x angle using Scipy , imutils and OpenCV functions. 
- Automatic White border or padding detection and cropping in an image 
Divide features into training and testing:
- random.seed(): What does it do? (coming soon) 
- Handle Categorical Features. (coming soon) 
- Feature selection (coming soon) 
- Divide features into training and testing. Get index information of train and test features. 
Normalization & Standardization:
Generating same set of data (train, test)
Some useful methods, packages:
Suppress a python warnings using following line of code:
1
2
3
import warnings
warnings.filterwarnings("ignore", message="TYPE THE WARNING MESSAGE THAT YOU ARE GETTING")










Comments