Apache Airflow

Score8.6 out of 10

46 Reviews and Ratings

What is Apache Airflow?

Apache Airflow is an open source tool that can be used to programmatically author, schedule and monitor data pipelines using Python and SQL. Created at Airbnb as an open-source project in 2014, Airflow was brought into the Apache Software Foundation’s Incubator Program 2016 and announced as Top-Level Apache Project in 2019. It is used as a data orchestration solution, with over 140 integrations and community support.

Categories & Use Cases

Videos

Multi-platform scheduling
Multi-platform scheduling is the ability to centrally manage a business process from end-to-end
Category average: 9.2
Central monitoring
A central monitoring dashboard provides data on trends and forecasts
Category average: 9
Logging
Logging and audit trails to ensure regulatory compliance
Category average: 8.6

Alerts and notifications
Alerts and notifications enabling management by exception
Category average: 8.6
Analysis and visualization
Analysis and visualization tools provide clear understanding of critical errors and helps prioritize errors
Category average: 8.3
Application integration
Integration with a broad range of enterprise applications
Category average: 8.4

#1 most frequent

Professional, Scientific, and Technical Services

26.8%

5,596 installations of 20,852

#2 most frequent

Information

18.7%

3,903 installations of 20,852

#3 most frequent

Finance and Insurance

11.7%

2,437 installations of 20,852

Raghwendra Singh

SDE 4 in Information Technology at Meesho (1001-5000 employees employees)

Use Cases and Deployment Scope

I am part of the data platform team, where we are responsible for building the platform for data ingestion, an aggregation system, and the compute engines. Apache Airflow is one of the core systems responsible for orchestrating pipelines and scheduled workflows. We have multiple deployments of Apache Airflow running for different use cases, each with a workflow of 5,000 to 9,000 DAGs and executing even more DAGs. The Apache Airflow now also offers HA with scheduler replicas, which is a lifesaver and is well-maintained by the community.

Pros

Apache Airflow is one of the best Orchestration platforms and a go-to scheduler for teams building a data platform or pipelines.
Apache Airflow supports multiple operators, such as the Databricks, Spark, and Python operators. All of these provide us with functionality to implement any business logic.
Apache Airflow is highly scalable, and we can run a large number of DAGs with ease. It provided HA and replication for workers. Maintaining airflow deployments is very easy, even for smaller teams, and we also get lots of metrics for observability.

Cons

To achieve a production-ready deployment of Apache Airflow, you require some level of expertise. A repository of officially maintained sample configurations of Helm charts will be handy for a new team.
As airflow is used to build many data pipelines, a feature for building lineage using queries for different compute engines will help develop the data catalog. Typically, multiple tools are required for this use case.
For building a data pipeline from upstream to downstream tables, using Airflow with lineage to trigger the downstream DAGs after recovery will be helpful. Additionally, creating a dependency between the DAGs would be beneficial.

Return on Investment

By using Apache Airflow, we were able to build the data platform and migrate our workloads out of Hevo Data.
Airflow currently powers the datasets for the entire company, supporting analytics backends, data science, and data engineering use cases.
We can scale the DAGS from < 1000 to currently> 8000 dag runs per day using HA and worker scaling.

Usability

Alternatives Considered

Databricks Data Intelligence Platform

Other Software Used

Apache Spark, Databricks Data Intelligence Platform, Trino

Verified User

Consultant in Engineering (10,001+ employees employees)

Use Cases and Deployment Scope

Apache Airflow is a best orchestrator in market. It gives us to flexibility to orchestrate our data engineering workflows with various levels of modifications possible through python programming. It allows us to connect with various cloud providers like Google, AWS and Azure which enables the teams to work in cross cloud environment.

Pros

Provides Connection to different Cloud Providers
Good Access Management
Good User Interface for Users to interact with. If we need to pause , trigger manually , mark any task as successful etc

Cons

A local "dry run" or IDE plugin that can validate and simulate DAG execution without needing a full environment.
Better feedback on DAG parse errors in the UI or CLI.
Navigating large DAGs with hundreds of tasks can be slow and hard to understand visually.

Return on Investment

Apache Airflow various options to interact with different databases around controlled by the business since we get the flexibility to write in python.
Since Apache Airflow requires python programming hence onboarding people it takes time to onboard the data pipelines because it requires some development effort
Apache Airflow makes monitoring easy for all the stake holders as business can see their pipelines running in UI

Usability

Alternatives Considered

AWS Step Functions

Other Software Used

Docker, Kubernetes, DBeaver

Anshuman Varshney

Engineering Manager in Engineering at Gameskraft (501-1000 employees employees)

Use Cases and Deployment Scope

We are using Apache Airflow as an orchestration tool in data engineering workflows in gaming product.
We are scheduling multiple jobs i.e hourly / daily / weekly / monthly.
We have a lot of requirement for dependent jobs i.e job1 should mandatory run before job2, and Apache Airflow does this work very swiftly, we are utilising multiple Apache Airflow integration with webhook and APIs. Additionally, we are doing a lot of jobs monitoring and SLA misses via Apache Airflow features

Pros

Job scheduling
Dependent job workflows
Failure handling and rerun of workflows

Cons

Better User Interface

Return on Investment

Good in job scheduling and dependency management between jobs
Robust framework to monitor jobs and alert in case of failure and SLA misses
Great integration with multiple open source tools

Usability

Alternatives Considered

Prefect

Other Software Used

DataHub, Grafana, Bitbucket

Alok Pabalkar View profile

Co-Founder & CTO in Research & Development at Mini Venture Lab Private Limited (1-10 employees employees)

Use Cases and Deployment Scope

Used Airflow for Analytics & Reporting

Pros

Reports
Sending Bulk Email/Notification
Processing from different data sources

Cons

Improve the GUI Control Panel
Provide more example and documentation
Improvement in debugging

Return on Investment

Impact Depends on number of workflows. If there are lot of workflows then it has a better usecase as the implementation is justified as it needs resources , dedicated VMs, Database that has a cost
Donot use it if you have very less usecases

Other Software Used

Apache Kafka, Redis™*, PostgreSQL

Victor Tay

Engineer in Product Management at IronNet CyberSecurity (201-500 employees employees)

Use Cases and Deployment Scope

We use apache airflow as part of our DAG scheduler and health monitoring tool. It serves as a core component in ensuring our scheduled jobs are run, the ability to allow us to inspect jobs successes and failures, and as a troubleshooting tool in an event of job errors/failures. It has been a core tool and we are happy with what it does.

Pros

Job scheduling - Pretty straightforward in terms of UI.
Job monitoring - Dashboard is as straightforward as it gets.
Troubleshooting jobs - ability to dive into detailed errors and navigate the job workflow.

Cons

UI/Dashboard can be updated to be customisable, and jobs summary in groups of errors/failures/success, instead of each job, so that a summary of errors can be used as a starting point for reviewing them.
Navigation - It's a bit dated. Could do with more modern web navigation UX. i.e. sidebars navigation instead of browser back/forward.
Again core functional reorg in terms of UX. Navigation can be improved for core functions as well, instead of discovery.

Return on Investment

It is a good workflow job scheduler.
It meets all, if not most of our organization product requirements.
AirFlow stability in terms of the product reliability is unmatched.

Alternatives Considered

Jenkins and Apache Kafka

Other Software Used

Jenkins, Apache Kafka, Redis™*

Apache Airflow

What is Apache Airflow?

Categories & Use Cases

Videos

Apache Airflow Key Features

Top Performing Features

Multi-platform scheduling

Central monitoring

Logging

Areas for Improvement

Alerts and notifications

Analysis and visualization

Application integration

Most Frequent Users

Professional, Scientific, and Technical Services

Information

Finance and Insurance

Apache Airflow Reviews

Use Cases and Deployment Scope

Pros

Cons

Return on Investment

Usability

Alternatives Considered

Other Software Used

Use Cases and Deployment Scope

Pros

Cons

Return on Investment

Usability

Alternatives Considered

Other Software Used

Use Cases and Deployment Scope

Pros

Cons

Return on Investment

Usability

Alternatives Considered

Other Software Used

Use Cases and Deployment Scope

Pros

Cons

Return on Investment

Other Software Used

Use Cases and Deployment Scope

Pros

Cons

Return on Investment

Alternatives Considered

Other Software Used