Lessons from Maintaining Mission Critical Services at Scale
Operational analytics systems require always-on data, which means high expectations on availability and data latency. Monitoring the health of data pipelines and the underlying infrastructure supporting applications and dashboards is mission critical for these use cases. At Rill, we’ve fielded many questions on best practices for implementing health checks and proactive observability for both open source Apache Druid and Rill’s managed Apache Druid.
In this post, we will seek to provide answers to those questions—both the technical set-up for effective data monitoring along with a suggested approach for comprehensive alerts. Specifically:
Part 1: Technical Integration for Workflow
- Setting up Airflow to monitor pipelines and Druid data ingestion
- Integrating DAGs with systems such as Slack and Opsgenie
Part 2: Observability Logic for High Uptime SLAs
- Lifecycle Observability: Alerts to Maintain Mission Critical Systems
As background to our monitoring approach, we started with a few guiding principles:
- Be proactive: err towards early warnings to avoid downstream issues
- Catch as many issues as possible … without alerting overkill
- Integrate alerts into existing workflows to avoid yet another tool to monitor
- Where possible, learn from mistakes and automate fixes
- Avoid repeated work and create a system that is low maintenance
At Rill, we have built out a variety of tools to proactively mitigate potential pipeline issues and warn against unexpected system challenges. We believe it is important to think about the process for notifications, exceptions and resolving issues as much as the technology. Our data teams spend time in Jira and Slack, so we selected those as our notification tools. Additionally, to “dog food” our own reporting, we use Rill Explore dashboards for root cause analysis of real-time Druid metrics. Holding all of that together is Apache Airflow, which we selected to orchestrate pipelines and health checks.
Using the above, we settled on the architecture below, which fits well within our workflow, requires minimal ongoing maintenance, and has flexibility to create comprehensive alerting.
In addition to alerting architecture, we created our observability framework on two levers: data lifecycle and type of alerts (rule-based vs. data-based).
We begin this post laying out the architecture above and will cover the lifecycle alerts in more detail in a second post.
Setting up Airflow to monitor pipelines and Druid data ingestion
Below, we included the basic steps for getting Airflow started. For more details and examples, please check out: https://github.com/gorillio/airflow-druid-examples
Step 1: Run Services
Step 2: Populate Connections, Variables and Pools
source environment variables
note: wait for a couple of minutes for scheduler to come up
Step 3: Airflow Web Login
Access flower web uri—the web-based tool for managing clusters: http://localhost:5555
Integrating DAGs with systems such as Slack and Opsgenie
Setup Slack alerts
The easiest way to set up a slack connection is using slack incoming webhooks. Any webhook that is created is associated with an app and a channel.
1. Create a slack app by using the link. Choose to create an app from scratch and select the relevant workspace.
- Create a Slack app
- Click on the app and go to the webhooks. Choose incoming webhooks and add new webhook to the workspace
- Create a slack channel for alerts or use existing slack channel.
- Add new webhook to the workspace by assigning webhook to the channel created in above step
2. Install airflow.providers.slack [pypi-repository]
The package provides hooks to connect to Slack and operators to include in the dag.
3. Add airflow connection
Setup Opsgenie alerts
1. Create a custom opsgenie API
Add “API” from opsgenie “integrations list” and fill in the relevant fields. When API is created a key will be generated and should be used in airflow connections
2. Install airflow.providers.opsgenie [pypi-repository]
The package provides hooks to connect to opsgenie and operators to include in the dag.
3. Add opsgenie connection
We are using OpsgenieAlertOperator defined in airflow.contrib.operators.opsgenie_alert_operator. [source]
Here we are passing the description from upstream using xcom_pull.
Integrating Airflow and Opsgenie is a relatively straightforward task. The important step for our observability was then determining priority/routing, notification details, and resolution process. A few lessons learned from our experience:
- Consider which alerts must be resolved in real-time. Make sure those go to the loudest alert channel (in our case, Opsgenie).
- Push resolution to the teams closest to business logic where possible. In our case, that means Customer Success teams working directly with clients who know the details of client pipelines and business needs. This routing has also allowed engineers to focus on core development and devops maintenance versus support ticket resolution.
- System errors and failures can be opaque. We have orchestrated our alerts with additional pointers learned over time to direct teams to quicker resolutions.
- Many alerts are threshold based - we also provide details versus the historical patterns to understand degree of change.
- Link runbooks and/or relevant customer tickets directly in notification channels.
- Make sure to document major issues with incident reports and learnings.
- Regularly implement new checks. We have a monthly incident review to identify new patterns and notifications that may be required.
Continue on to Part 2: Observability Logic for High Uptime SLAs.