Data Pipelines¶
What is a Data Pipeline?¶
“A arbitrarily complex chain of processes that manipulate data where the output data of one process becomes the input to the next.”
“A process to take raw data and transform in a way that usable by the entire organization.”
“Data Processing Pipeline is a collection of instructions to read, transform or write data that is designed to be executed by a data processing engine.”
A data pipeline can have these characteristics:
- 1 or more data inputs.
- 1 or more data outputs.
- Optional filtering.
- Optional transformation, including schema changes (adding or removing fields) and transforming the format.
- Optional aggregation, including group by, joins, and statistics.
- Other robustness features.
Who Needs a Data Pipeline?¶
Generate, rely on, store, or Maintain large amounts or multiple sources of data.
Require real-time or highly sophisticated data analysis.
Store data in the cloud.
Most of the companies you interface with on a daily basis — and probably your own — would benefit from a data pipeline.
Backlinks:
list from [[Data Pipelines]] AND -"Changelog"