Apache Airflow for Orchestration and Monitoring of Apache Druid
Part 1: Technical Integration for Workflow
Scott Cohen
Rohith Reddy
Scott Cohen
Scott Cohen
on
October 18, 2021

Lessons from Maintaining Mission Critical Services at Scale

Operational analytics systems require always-on data, which means high expectations on availability and data latency. Monitoring the health of data pipelines and the underlying infrastructure supporting applications and dashboards is mission critical for these use cases. At Rill, we’ve fielded many questions on best practices for implementing health checks and proactive observability for both open source Apache Druid and Rill’s managed Apache Druid.

In this post, we will seek to provide answers to those questions—both the technical set-up for effective data monitoring along with a suggested approach for comprehensive alerts. Specifically:

Part 1: Technical Integration for Workflow

  • Setting up Airflow to monitor pipelines and Druid data ingestion
  • Integrating DAGs with systems such as Slack and Opsgenie

Part 2: Observability Logic for High Uptime SLAs

  • Lifecycle Observability: Alerts to Maintain Mission Critical Systems

Background 

As background to our monitoring approach, we started with a few guiding principles:

  • Be proactive: err towards early warnings to avoid downstream issues
  • Catch as many issues as possible … without alerting overkill
  • Integrate alerts into existing workflows to avoid yet another tool to monitor
  • Where possible, learn from mistakes and automate fixes
  • Avoid repeated work and create a system that is low maintenance

At Rill, we have built out a variety of tools to proactively mitigate potential pipeline issues and warn against unexpected system challenges. We believe it is important to think about the process for notifications, exceptions and resolving issues as much as the technology. Our data teams spend time in Jira and Slack, so we selected those as our notification tools. Additionally, to “dog food” our own reporting, we use Rill Explore dashboards for root cause analysis of real-time Druid metrics. Holding all of that together is Apache Airflow, which we selected to orchestrate pipelines and health checks.

Using the above, we settled on the architecture below, which fits well within our workflow, requires minimal ongoing maintenance, and has flexibility to create comprehensive alerting.

High-Level Rill Data Health Workflow
High-Level Rill Data Health Workflow

In addition to alerting architecture, we created our observability framework on two levers: data lifecycle and type of alerts (rule-based vs. data-based). 

Data Lifecycle Alerts Alert Types
  • Data completeness
  • Data quality
  • Pipeline/ingestion failures
  • Performance
  • Rule-Based (data types, nulls, known limits/minimums)
  • Data-Based (volume abnormalities, expected counts, key metric distribution)

We begin this post laying out the architecture above and will cover the lifecycle alerts in more detail in a second post.


Setting up Airflow to monitor pipelines and Druid data ingestion

Below, we included the basic steps for getting Airflow started. For more details and examples, please check out: https://github.com/gorillio/airflow-druid-examples

Step 1: Run Services

Deep Cleanup

$ make clean-all

Start Airflow

$ make run

Step 2: Populate Connections, Variables and Pools

source environment variables

$ export INITDB_FIRSTNAME=admin
$ export INITDB_LASTNAME=admin
$ export INITDB_EMAIL=admin
$ export INITDB_PASSWORD=admin
$ export INITDB_USERNAME=admin
$ export INITDB_ROLE=Admin
$ export DRUID_HOST=xxxxxxxxxxxxx
$ export DRUID_LOGIN_USER=xxxxxxxxxxxxx
$ export DRUID_PASSWORD=xxxxxxxx

note: wait for a couple of minutes for scheduler to come up

$ make connections

Step 3: Airflow Web Login

note: wait for a couple of minutes for webserver to serve Airflow Login: http://localhost:8080

Clean Setup

$ make clean

Access flower web uri—the web-based tool for managing clusters: http://localhost:5555

Integrating DAGs with systems such as Slack and Opsgenie

Setup Slack alerts

The easiest way to set up a slack connection is using slack incoming webhooks. Any webhook that is created is associated with an app and a channel. 

1. Create a slack app by using the link. Choose to create an app from scratch and select the relevant workspace.

  • Create a Slack app
  • Click on the app and go to the webhooks. Choose incoming webhooks and add new webhook to the workspace
  • Create a slack channel for alerts or use existing slack channel.
  • Add new webhook to the workspace by assigning webhook to the channel created in above step

2. Install airflow.providers.slack [pypi-repository]

pip install apache-airflow-providers-slack

The package provides hooks to connect to Slack and operators to include in the dag.


3. Add airflow connection

airflow connections add 'slack' \
--conn-type 'http' \
--conn-description 'slack connection' \
--conn-host 'https://hooks.slack.com/services' \
--conn-password '/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'


Setup Opsgenie alerts

1. Create a custom opsgenie API

Add “API” from opsgenie “integrations list” and fill in the relevant fields. When API is created a key will be generated and should be used in airflow connections


2. Install airflow.providers.opsgenie [pypi-repository]

pip install apache-airflow-providers-opsgenie

The package provides hooks to connect to opsgenie and operators to include in the dag.


3. Add opsgenie connection

airflow connections add 'opsgenie' \
--conn-type 'opsgenie' \
--conn-description 'opsgenie connection' \
--conn-host '' \
--conn-password 'XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX'

We are using OpsgenieAlertOperator defined in airflow.contrib.operators.opsgenie_alert_operator. [source]

Here we are passing the description from upstream using xcom_pull.

OpsgenieAlertOperator(
  task_id='notify',
  message='Mission Control Checks',
  opsgenie_conn_id='opsgenie',
  description='{{ ti.xcom_pull(task_ids="alert_summary") }}',
  responders=[{"name":"on-call", "type":"team"}],
  visible_to=[{"name":"on-call", "type":"team"}]
)


Setup Summary

Integrating Airflow and Opsgenie is a relatively straightforward task. The important step for our observability was then determining priority/routing, notification details, and resolution process. A few lessons learned from our experience:

Priority/Routing

  • Consider which alerts must be resolved in real-time. Make sure those go to the loudest alert channel (in our case, Opsgenie). 
  • Push resolution to the teams closest to business logic where possible. In our case, that means Customer Success teams working directly with clients who know the details of client pipelines and business needs. This routing has also allowed engineers to focus on core development and devops maintenance versus support ticket resolution. 

Notification Details

  • System errors and failures can be opaque. We have orchestrated our alerts with additional pointers learned over time to direct teams to quicker resolutions. 
  • Many alerts are threshold based - we also provide details versus the historical patterns to understand degree of change.

Resolution process

  • Link runbooks and/or relevant customer tickets directly in notification channels.
  • Make sure to document major issues with incident reports and learnings.
  • Regularly implement new checks. We have a monthly incident review to identify new patterns and notifications that may be required.


Continue on to Part 2: Observability Logic for High Uptime SLAs.

Ready to see Rill in action? Try Free