Data engineering is a set of operations aimed at creating interfaces and mechanisms for the flow and access of information. It takes dedicated specialists – data engineers – to maintain data so that it remains available and usable by others. In short, data engineers set up and operate the organization’s data infrastructure preparing it for further analysis by data analysts and scientists. (View Highlight)
To understand data engineering in simple terms, let’s start with data sources. Within a large organization, there are usually many different types of operations management software (e.g., ERP, CRM, production systems, etc.), all of which contain different databases with varied information. Besides, data can be stored as separate files or even pulled from external sources in real time (such as various IoT devices). As the number of data sources multiplies, having data scattered all over in various formats prevents the organization from seeing the full and clear picture of their business state. (View Highlight)
The process of moving data from one system to another, be it a SaaS application, a data warehouse (DW), or just another database, is maintained by data engineers (read on to learn more about the role and skillset of this specialist). (View Highlight)
A Data pipeline is basically a set of tools and processes for moving data from one system to another for storage and further handling. It captures datasets from multiple sources and inserts them into some form of database, another tool, or app, providing quick and reliable access to this combined data for the teams of data scientists, BI engineers, data analysts, etc. (View Highlight)
Constructing data pipelines is the core responsibility of data engineering. It requires advanced programming skills to design a program for continuous and automated data exchange. A data pipeline is commonly used for
moving data to the cloud or to a data warehouse,
wrangling the data into a single location for convenience in machine learning projects,
integrating data from various connected devices and systems in IoT,
copying databases into a cloud data warehouse, and
bringing data to one place in BI for informed business decisions.
You can read our detailed explanatory post to learn more about data pipelines, their components, and types. Now, let’s explore what ETL stands for. (View Highlight)
Extract — retrieving incoming data. At the start of the pipeline, we’re dealing with raw data from numerous separate sources. Data engineers write pieces of code – jobs – that run on a schedule extracting all the data gathered during a certain period.
Transform — standardizing data. Data from disparate sources is often inconsistent. So, for efficient querying and analysis, it must be modified. Having data extracted, engineers execute another set of jobs that transforms it to meet the format requirements (e.g., units of measure, dates, attributes like color or size.) Data transformation is a critical function, as it significantly improves data discoverability and usability.
Load — saving data to a new destination. After bringing data into a usable state, engineers can load it to the destination that typically is a relational database management system (RDBMS), a data warehouse, or Hadoop. Each destination has its specific practices to follow for performance and reliability.
Once the data is transformed and loaded into a single storage, it can be used for further analysis and business intelligence operations, i.e., generating reports, creating visualizations, etc.
NB: Despite being automated, a data pipeline must be constantly maintained by data engineers. They repair failures, update the system by adding/deleting fields, or adjust the schema to the changing needs of the business. (View Highlight)
Setting up secure and reliable data flow is a challenging task. There are so many things that can go wrong during data transportation: Data can be corrupted, hit bottlenecks causing latency, or data sources may conflict, generating duplicate or incorrect data. Getting data into one place requires careful planning and testing to filter out junk data, eliminating duplicates and incompatible data types to obfuscate sensitive information while not missing critical data. (View Highlight)
“The importance of a healthy and relevant metrics system is that it can inform us of the status and performance of each pipeline stage, while with underestimating the data load, I am referring to building the system in such a way that it won’t face any overload if the product experiences an unexpected surge of users,” elaborates Juan. (View Highlight)
Having an ETL pipeline that is only connected to a standard transactional database is ineffective as the volume and variety of data requires a dedicated storage design. So, besides an ETL pipeline there’s a need to build a data warehouse that supports and facilitates BI activities. (View Highlight)
A data warehouse (DW) is a central repository where data is stored in query-able forms. From a technical standpoint, a data warehouse is a relational database optimized for reading, aggregating, and querying large volumes of data. Traditionally, DWs only contained structured data, or data that can be arranged in tables. However, modern DWs can combine both structured and unstructured data where unstructured refers to a wide variety of forms (such as images, pdf files, audio formats, etc.) that are harder to categorize and process. (View Highlight)
Surprisingly, DW isn’t a regular database. How so?
First of all, they differ in terms of data structure. A regular database normalizes data excluding any data redundancies and separating related data into tables. This takes up a lot of computing resources, as a single query combines data from many tables. Contrarily, a DW uses simple queries with few tables to improve performance and analytics.
Second, aimed at day-to-day transactions, standard transactional databases don’t usually store historic data, while for warehouses, it’s their main purpose, as they collect data from multiple periods. DW simplifies a data analyst’s job, allowing for manipulating all data from a single interface and deriving analytics, visualizations, and statistics. (View Highlight)
To construct a data warehouse, four basic components are combined. (View Highlight)
Data warehouse storage. The foundation of data warehouse architecture is a database that stores all enterprise data allowing business users to access it for drawing valuable insights. (View Highlight)
Metadata. Adding business context to data, metadata helps transform it into comprehensible knowledge. Metadata defines how data can be changed and processed. It contains information about any transformations or operations applied to source data while loading it into the data warehouse. (View Highlight)
Data warehouse access tools. Designed to facilitate interactions with DW databases for business users, access tools need to be integrated with the warehouse. They have different functions. For example, query and reporting tools are used for generating business analysis reports. Another type of access tools – data mining tools – automate the process of finding patterns and correlations in large amounts of data based on advanced statistical modeling techniques. (View Highlight)
Data warehouse management tools. Spanning the enterprise, data warehouse deals with a number of management and administrative operations. That’s why managing a DW requires a solution that can facilitate all these operations. Dedicated data warehouse management tools exist to accomplish this. (View Highlight)
Data warehouses are a great step forward in enhancing your data architecture. However, if you have hundreds of users from different departments, DWs can be too bulky and slow to operate. In this case, data marts can be built and implemented to increase speed and efficiency. (View Highlight)
Simply speaking, a data mart is a smaller data warehouse (their size is usually less than 100Gb.). They become necessary when the company (and the amount of its data) grows and it becomes too long and ineffective to search for information in an enterprise DW. Instead, data marts are built to allow different departments (e.g., sales, marketing, C-suite) to access relevant information quickly and easily. (View Highlight)
OLAP or Online Analytical Processing refers to the computing approach that allows users to perform multidimensional data analysis. It’s contrasted with OLTP or Online Transactional Processing, which is a simpler method of interacting with databases that isn’t designed for analyzing massive amounts of data from different perspectives. (View Highlight)
It’s important to note that OLAP cubes have to be custom built for every particular report or analytical query. However, their usage is justified since, as we said, they enable advanced, multidimensional analysis that was previously too complicated to perform. (View Highlight)
In contrast to the ETL architecture we described above, a data lake uses the ELT approach swapping transform and load operations. Supporting large storage and scalable computing, a data lake starts data loading immediately after extracting it, handling raw — often unstructured — data. (View Highlight)
Data lakes are also a powerful tool for data scientists and ML engineers, who would use raw data to prepare it for predictive analytics and machine learning. Read more about data preparation in our separate article or watch this 14-minute explainer. (View Highlight)
Now that we know what data engineering is concerned with, let’s delve into the role that specializes in creating software solutions around big data – a data engineer. (View Highlight)
He also adds that a data engineer might collaborate with the others at the time of implementing or designing a data-related feature (or product) such an A/B test, the deployment of a machine learning model, and the refinement of an existing data source. (View Highlight)
Software engineering background. Data engineers use programming languages to enable clean, reliable, and performative access to data and databases. Juan points out their ability to work with the complete cycle of software development including ideation, architecture design, deployment and DevOps, prototyping, testing, defining metrics, alerts, and monitoring systems. Data engineers are experienced programmers in at least Python or Scala/Java. (View Highlight)
Data-related skills. “A data engineer should have knowledge of multiple kinds of databases (SQL and NoSQL), data platforms, concepts such as MapReduce, batch and stream processing, and even some basic theory of data itself, e.g., data types, and descriptive statistics,” underlines Juan. (View Highlight)
Systems creation skills. Data engineers need to have experience with various data storage technologies and frameworks they can combine to build data pipelines. (View Highlight)