In Part 1: Technical Integration for Workflow, we covered our observability architecture using Apache Airflow integrated with Opsgenie and Slack. For more details, you'll want to start there. Otherwise, jump into our coverage of Rill’s data health checks below!
Lifecycle Observability: Alerts to Maintain Mission Critical Systems
For pipeline management, we suggest considering the lifecycle of your data - attempting to identify potential issues as early in the lifecycle as possible. To meet our own goals, we identified a series of health checks to run as data moves from raw files or events into Druid and then from Druid datasets into analysis. In this section, we’ll provide suggestions for types of tests to apply at each stage of the data lifecycle based on our experience. These data checks run in parallel with our always-on listening for our Druid service running on kubernetes.
To start, we run comparisons of data versus expected values to identify gaps. The goal of this first step is to avoid loading incomplete or malformed files. As we are a downstream consumer of client data, these alerts allow us to proactively notify customers of potential issues in their own ETL processes.
An important trade-off here is cost and timeliness: more exact checks require loading data for comparison which means more in memory and also a potential delay in processing. Where possible, we limit initial completeness checks to metadata characteristics managed without loading data (there are a limited number of quality checks achieved on sample data outlined in the next section).
At this stage, we are primarily concerned with validating against corrupted files and/or data completely outside the norm. Tests include:
- Min row count/events
- Min file size
- Expected partitions/files
Data Quality: Business Logic Tests
Once we begin processing data, we move into the heart of data quality checks based on business logic for each datasource. We broke those into two types: static—rule checks, and dynamic—data-driven tests. To avoid processing files with major issues, we run a subset of the tests (particularly those focused on priority metrics) on a sample of the data to alert before processing the full interval. Another advantage to this approach is that it provides better iteration and testing during pipeline build as well as the ability to estimate processing costs during that initial implementation stage.
In creating quality tests, we found it is easy to get lost in the weeds of creating too many alerts, trying to capture every potential scenario. After creating a lengthy list of validations, we pared that back to a more manageable number of tests. Also, we erred more on the side of data-driven validation (based on the previous 30 days of data) versus rule-based to maximize flexibility and reduce potential rework. Finally, we would also suggest starting small and adding as you learn more with pipelines in production.
Below, we included illustrative examples of the tests executed in Airflow for each validation type:
Once the data is flowing, we also encounter potential issues with ingestion. There are a variety of errors that we've learned to look for based on characteristics of the pipelines over time. Rather than identifying isolated data issues, the focus for this step is to identify root causes of issues as quickly as possible and routing the alerts to the appropriate stakeholders. Another important distinction is identifying not only errors/failures but also unusual characteristics in the data post-processing that might indicate re-processing is required. Once we built some experience triaging each issue, we also created automated Airflow DAGs on certain triggers where we could clearly identify the cause. This change had the effect of greatly reducing any data lag that failures might introduce into our pipelines.
Below are examples of tests we execute to reduce any latency in production data:
Performance: Query Latency
Finally, we monitor end user performance to make sure that we are delivering on what Druid does best—sub-second queries on massive datasets. We have found that end user experiences can vary depending on the workloads and queries executed, so we created monitoring for edge cases and worst case scenarios (e.g. timeouts) to identify issues before they impact user experience. Also, as Druid has multiple tiers for storage (for real-time vs. regular queries), we separate monitoring by data sources and data tiers. We then use our Rill Explore dashboards to narrow further into specific query patterns, time intervals, dimensions, etc. that could help to identify potential fixes (either in Druid configuration, lookup logic, ETL optimization, etc). Below are a few examples:
We have seen a few patterns with Druid in less performant queries (more than 5 seconds) that you may want to monitor or investigate when issues arise:
- Overly large Druid query time lookups (>50MB)
- Datasketches and uniques
- Compaction and overall segment size
- Dimension stripping (automated via Druid for historical data no longer needed)
- Query cache sizing
Our experience with Apache Druid observability is that maintaining an always-on system is a journey, not a destination. We have continually added to our logic and refined our workflows with the goal of automating fixes or mitigating issues as quickly as possible.
A few key takeaways from that journey:
- Integrate business stakeholders into the alerting process. We found that end user experience may differ from metrics and requires an updated approach to monitoring.
- Meet people where they work. Our ability to remedy issues improved significantly by integrating observability into established workflows.
- Have post-mortems after hypothesis testing and failures. We found that each test or failure led to several other unique ways to avoid potential issues meaning less rework and fewer future issues.
Hopefully, our experience provides a bit of insight into monitoring your Druid cluster and the reporting available to customers at Rill. We’d love to continue the conversation or answer any questions: contact us.