Apache Airflow: What is it? How to use it? Ultimate Guide 2023
Apache Airflow is an open-source workflow scheduling platform widely used in the field of data engineering. Find out everything you need to know about this Data Engineer tool: how it works, use cases, main components…
The story of Apache Airflow begins in 2015, in the offices of AirBnB. At that time, the vacation rental platform founded in 2008 was experiencing dazzling growth and was awash in an ever-increasing volume of data. The Californian company recruits Data Scientists, Data Analysts and Data Engineers with a vengeance, and they must automate many processes by writing scheduled batch jobs. To support them, data engineer Maxime Beauchemin has created an open-source tool called Airflow.
This planning tool aims to enable teams to build, monitor data pipelines in batch, and iterate. In a few years, Airflow has established itself as a standard in the field of Data Engineering. In April 2016, the project joined the official incubator of the Apache Foundation. It continues its development and receives the status of “top-level” project in January 2019. Almost two years later, in December 2020, Airflow has more than 1,400 contributors, 11,230 contributions, and 19,800 stars on Github.
The Airflow 2.0 version has been available since December 17, 2020, and brings new features and many improvements. This tool is used by thousands of Data Engineers around the world.
What is Apache Airflow?
The Apache Airflow platform makes it possible to create, plan and monitor workflows (workflows) through computer programming. It is a completely open source solution, very useful for the architecture and orchestration of complex data pipelines and the launching of tasks. It has several advantages. It is first of all a dynamic platform, since everything that can be done with Python code can be done on Airflow.
It is also extensible, thanks to many plugins allowing interaction with most of the most common external systems. It is also possible to create new plugins to meet specific needs. In addition, Airflow provides elasticity. Data engineers teams can use it to perform thousands of different tasks every day.
Workflows are structured and expressed in the form of Directed Acyclic Graphs (DAGs), each node of which represents a specific task. Airflow is designed as a “code-first” platform, allowing very fast iteration on workflows. This philosophy provides a high degree of extensibility compared to other pipeline tools.
What is Airflow used for?
Airflow can be used for any batch data pipeline, so its use cases are as numerous as they are diverse. Due to its extensibility, this platform excels particularly for the orchestration of tasks with complex dependencies on multiple external systems. By writing pipelines in code and using the various plugins available, it is possible to integrate Airflow with any dependent systems from a unified platform for orchestration and monitoring.
As an example, Airflow can be used to aggregate daily sales team updates from Salesforce to send a daily report to company executives. Additionally, the platform can be used to organize and launch machine learning jobs running on external Spark clusters. It also allows website or application data to be uploaded to a Data Warehouse once per hour.
The Airflow architecture is based on several elements. Here are the main ones.
In Airflow, pipelines are represented as a DAG (Directed Acyclic Graph) defined in Python. A graph is a structure composed of objects (nodes) in which certain pairs of objects are related.
They are “Directed”, this means that the edges of the graphs are oriented, and that they therefore represent unidirectional links. “Acyclic”, because graphs do not have a circuit. That is, a node B downstream of a node A does not also be upstream of node A. This ensures that the pipelines do not have infinite loops.
Each node in a DAG represents a task. It is a representation of a sequence of tasks to be performed, which constitutes a pipeline. The jobs represented are defined by the operators.
Operators are the building blocks of the Airflow platform. They make it possible to determine the work carried out. This can be an individual task (node of a DAG), defining how the task will be executed.
The DAG helps ensure that operators are scheduled and performed in a specific order, while operators define the jobs to be performed at each stage of the process.
There are three main categories of operators:
- First, action operators perform a function. As an example, let’s cite the PythonOperator or the BashOperator.
- Transfer operators, on the other hand, allow data to be transferred from a source to a destination, like the S3ToRedshiftOperator.
- Finally, the Sensors allow you to wait for a condition to be verified. For example, we can use the FileSensor operator to wait for a file to be present in a given folder, before continuing the execution of the pipeline.
Each operator is defined individually. However, operators can communicate information to each other using XComs.
On Airflow, Hooks allow the interface with third-party systems. They allow you to connect to APIs and external databases such as Hive, S3, GCS, MySQL, Postgres…
Confidential information, such as login credentials, is kept outside of Hooks. They are stored in an encrypted metadata base associated with the current Airflow instance.
Airflow plugins can be described as a combination between Hooks and Operators. They are used to accomplish some specific tasks involving an external application.
This could be, for example, the transfer of data from Salesforce to Redshift. There is a large open-source collection of plugins created by the user community, and each user can create plugins to meet their specific needs.
“Connections” allow Airflow to store information, allowing to connect to external systems like identifiers or API tokens.
They are managed directly from the platform’s user interface. The data is encrypted and stored as metadata in a Postgres or MySQL database.
ABOUT LONDON DATA CONSULTING (LDC)
We, at London Data Consulting (LDC), provide all sorts of Data Solutions. This includes Data Science (AI/ML/NLP), Data Engineer, Data Architecture, Data Analysis, CRM & Leads Generation, Business Intelligence and Cloud solutions (AWS/GCP/Azure).
For more information about our range of services, please visit: https://london-data-consulting.com/services
Interested in working for London Data Consulting, please visit our careers page on https://london-data-consulting.com/careers
More info on: https://london-data-consulting.com