Apache Airflow for Orchestration and Monitoring of Apache Druid
Part 2: Observability Logic for High Uptime SLAs
Scott Cohen
Rohith Reddy
Scott Cohen
Scott Cohen
on
October 21, 2021

In Part 1: Technical Integration for Workflow, we covered our observability architecture using Apache Airflow integrated with Opsgenie and Slack. For more details, you'll want to start there. Otherwise, jump into our coverage of Rill’s data health checks below! 

Lifecycle Observability: Alerts to Maintain Mission Critical Systems

For pipeline management, we suggest considering the lifecycle of your data - attempting to identify potential issues as early in the lifecycle as possible. To meet our own goals, we identified a series of health checks to run as data moves from raw files or events into Druid and then from Druid datasets into analysis. In this section, we’ll provide suggestions for types of tests to apply at each stage of the data lifecycle based on our experience. These data checks run in parallel with our always-on listening for our Druid service running on kubernetes

High-Level Rill Data Lifecycle
High-Level Rill Data Lifecycle


Data Completeness

To start, we run comparisons of data versus expected values to identify gaps. The goal of this first step is to avoid loading incomplete or malformed files. As we are a downstream consumer of client data, these alerts allow us to proactively notify customers of potential issues in their own ETL processes. 

An important trade-off here is cost and timeliness: more exact checks require loading data for comparison which means more in memory and also a potential delay in processing. Where possible, we limit initial completeness checks to metadata characteristics managed without loading data (there are a limited number of quality checks achieved on sample data outlined in the next section). 

At this stage, we are primarily concerned with validating against corrupted files and/or data completely outside the norm. Tests include:

  • Min row count/events
  • Min file size
  • Expected partitions/files


Data Quality: Business Logic Tests

Once we begin processing data, we move into the heart of data quality checks based on business logic for each datasource. We broke those into two types: static—rule checks, and dynamic—data-driven tests. To avoid processing files with major issues, we run a subset of the tests (particularly those focused on priority metrics) on a sample of the data to alert before processing the full interval. Another advantage to this approach is that it provides better iteration and testing during pipeline build as well as the ability to estimate processing costs during that initial implementation stage. 

In creating quality tests, we found it is easy to get lost in the weeds of creating too many alerts, trying to capture every potential scenario. After creating a lengthy list of validations, we pared that back to a more manageable number of tests. Also, we erred more on the side of data-driven validation (based on the previous 30 days of data) versus rule-based to maximize flexibility and reduce potential rework. Finally, we would also suggest starting small and adding as you learn more with pipelines in production. 

Below, we included illustrative examples of the tests executed in Airflow for each validation type:

Validation Type Validation Check
Rule Based Validation Missing values
  • Not Null - null
  • Not Empty - ""
Data types
  • Is Encrypted - base64
  • Is Type
Expected data values
  • Regex Match
  • Is/Is Not Length
  • Integer/Real/String Range
Data Based Validation Missing data
  • Median Value Range
  • Total Counter
  • Distinct Counter
  • Min/Max Value Below % Range
    (vs. same hour in previous days)
Unexpected data
  • Value Distribution
  • Approx Quantiles
  • Max
  • Median Rule


Pipeline/Ingestion Failures

Once the data is flowing, we also encounter potential issues with ingestion. There are a variety of errors that we've learned to look for based on characteristics of the pipelines over time. Rather than identifying isolated data issues, the focus for this step is to identify root causes of issues as quickly as possible and routing the alerts to the appropriate stakeholders. Another important distinction is identifying not only errors/failures but also unusual characteristics in the data post-processing that might indicate re-processing is required. Once we built some experience triaging each issue, we also created automated Airflow DAGs on certain triggers where we could clearly identify the cause. This change had the effect of greatly reducing any data lag that failures might introduce into our pipelines. 

Below are examples of tests we execute to reduce any latency in production data:

Validation Type Validation Check
Missing Interval
  • Data Gap by Period (caused by missing data but also identifies “quiet” failures)
Processing Issues
  • Delivery Lag (from client)
  • Long Running Task
  • Failed Task (routed by error code
Data Lag
  • Latest Data Time vs. Expected Lag

Performance: Query Latency

Finally, we monitor end user performance to make sure that we are delivering on what Druid does best—sub-second queries on massive datasets. We have found that end user experiences can vary depending on the workloads and queries executed, so we created monitoring for edge cases and worst case scenarios (e.g. timeouts) to identify issues before they impact user experience. Also, as Druid has multiple tiers for storage (for real-time vs. regular queries), we separate monitoring by data sources and data tiers. We then use our Rill Explore dashboards to narrow further into specific query patterns, time intervals, dimensions, etc. that could help to identify potential fixes (either in Druid configuration, lookup logic, ETL optimization, etc). Below are a few examples:

Validation Type Validation Check
Query Performance
  • Average Query Time
  • P95, P99 (by data tier)
  • Total Query count by datasource/user
Negative Query Experiences
  • Long Running Query Alerts
  • Max Query Time (by user)
  • Timeouts

We have seen a few patterns with Druid in less performant queries (more than 5 seconds) that you may want to monitor or investigate when issues arise:

Conclusion

Our experience with Apache Druid observability is that maintaining an always-on system is a journey, not a destination. We have continually added to our logic and refined our workflows with the goal of automating fixes or mitigating issues as quickly as possible. 

A few key takeaways from that journey:

  • Integrate business stakeholders into the alerting process. We found that end user experience may differ from metrics and requires an updated approach to monitoring.
  • Meet people where they work. Our ability to remedy issues improved significantly by integrating observability into established workflows.
  • Have post-mortems after hypothesis testing and failures. We found that each test or failure led to several other unique ways to avoid potential issues meaning less rework and fewer future issues.



Hopefully, our experience provides a bit of insight into monitoring your Druid cluster and the reporting available to customers at Rill. We’d love to continue the conversation or answer any questions: contact us


Ready to see Rill in action? Try Free