top of page

Data Pipeline

  • neovijayk
  • Jul 6, 2020
  • 2 min read

Why we need to create a data pipeline?

  1. To execute machine learning projects we need a powerful computing platforms which are available on the cloud computing platforms.

  2. As of now there are different Cloud Computing platforms available.

  3. But to run a model on the powerful platforms we need data to be accessible from these powerful platforms during prototype development as well as after development for testing and further improvements.

  4. This data can be present on another machine or on the different storage platforms (for example on Client’s databases). Hence data-pipeline can be built to bring data for our model on one place.

  5. On this page I will discuss few useful data pipeline architecture, tools and techniques that I have learned from my experience and found useful.

Dig: Collecting captured data on one place using Data Pipeline


Google Cloud:

Data Pipeline using Python code Example:

Datapipeline tools:

Stitch

  1. About Stitch. Features & limitations of the Stitch (coming soon)

  2. End to End automation of data extraction using Stitch from Data source to Destination (coming soon)

Talend Open Studio (Open source)

Other important tools:

Apache Airflow (Open source)

  1. Apache Airflow for Datapipeline and Machine Learning pipeline

  2. Apache Airflow installation steps on Ubuntu

  3. Apache Airflow single node and multi-node Architecture

  4. Scheduling and Running Talend data pipeline job using Apache Airflow to fetch data from google cloud storage to destination machine (coming soon)

  5. Scheduling and Running Talend data pipeline job using Apache Airflow to write or to store data files to google cloud storage from local machine directory (coming soon)

  6. YouTube Video – Industrial Machine Learning Pipelines with Python & Airflow by Alejandro Saucedo

  7. Article – Problems faced by Bluecore Engineering with Apache Airflow operators and how they solve it using Kubernets operators

Apache Kafka (Open source)

MySQL Connector API in Python:

Some useful blogs from other websites:

  1. Real Time Data Engineering Pipeline for Machine Learning by Engineering@ZenOfAI. Link

Recent Posts

See All

Комментарии


Subscribe to BrainStorm newsletter

For notifications on latest posts/blogs

Thanks for submitting!

  • Twitter
  • Facebook
  • Linkedin

© 2023 by my-learnings.   Copy rights Vijay@my-learnings.com

bottom of page