Data Pipeline
- neovijayk
- Jul 6, 2020
- 2 min read
Why we need to create a data pipeline?
To execute machine learning projects we need a powerful computing platforms which are available on the cloud computing platforms.
As of now there are different Cloud Computing platforms available.
But to run a model on the powerful platforms we need data to be accessible from these powerful platforms during prototype development as well as after development for testing and further improvements.
This data can be present on another machine or on the different storage platforms (for example on Client’s databases). Hence data-pipeline can be built to bring data for our model on one place.
On this page I will discuss few useful data pipeline architecture, tools and techniques that I have learned from my experience and found useful.

Dig: Collecting captured data on one place using Data Pipeline
Google Cloud:
Data Pipeline using Python code Example:
Datapipeline tools:
Stitch
About Stitch. Features & limitations of the Stitch (coming soon)
End to End automation of data extraction using Stitch from Data source to Destination (coming soon)
Talend Open Studio (Open source)
Other important tools:
Apache Airflow (Open source)
Apache Airflow for Datapipeline and Machine Learning pipeline
Scheduling and Running Talend data pipeline job using Apache Airflow to fetch data from google cloud storage to destination machine (coming soon)
Scheduling and Running Talend data pipeline job using Apache Airflow to write or to store data files to google cloud storage from local machine directory (coming soon)
YouTube Video – Industrial Machine Learning Pipelines with Python & Airflow by Alejandro Saucedo
Apache Kafka (Open source)
MySQL Connector API in Python:
Some useful blogs from other websites:
Real Time Data Engineering Pipeline for Machine Learning by Engineering@ZenOfAI. Link
Комментарии