When should I use Apache Druid? Try this checklist.
Apache Druid is purpose built to generate high performance at low cost on a set of use cases that are becoming increasingly common, known as Operational Analytics. Operational Analytics, also referred to as “Continuous Analytics”, involves analyzing real time data to make real time decisions. It’s “analytics on the fly”—business signals are processed in real time and then fed to decision makers who make appropriate adjustments to optimize the business activity.
When your use case meets a set of characteristics that make it optimal for Druid, Druid’s data engine can generate extremely fast results which enable the fast decision making required by operational analytics.
Moving from the data warehouse to Operational Intelligence
We are living in an era where data rules and no matter what product or services your company produces, you are almost certainly using data driven analytics to understand your customers and the performance of your product or services. As the sheer volume of data and need for analytics has grown, we’ve seen analytics quickly transition from first-gen on-prem RDBMS such as MySQL, Postgres, and Oracle, to second-gen SaaS data warehouse services such as Google BigQuery, Snowflake, and Amazon Redshift. These SaaS data services provide excellent NoOps solutions to a company’s data management problems, removing the need for DevOps and allowing a business to focus on its performance, rather than the management of its data.
But as the data has grown, so have the use cases. While many data teams breathe a sigh of relief once they transition their data into the NoOps world of BigQuery or Snowflake, the reality is that the diversity of use cases calls for a variety of solutions.
To give an example, think about the difference between analyzing:
- ten years of retail data
- this week’s social media campaign
- clickstreams of a mobile app
Each has its own intricacies. The retail data analytics requires a database that supports updates and the goal may be to generate weekly reports. The social media marketing campaign requires the ability to make decisions based on data generated over the past 24 hours. Finally the clickstream analytics must support streamed-in data with sub-second decision making.
When is Apache Druid most cost-effective?
Before we discuss the characteristics that make your use case optimal for Druid, let’s talk about performance. You might be wondering, will I really see a difference in performance if I move from my 2nd generation cloud warehouse (Snowflake, BigQuery, Redshift) to Druid? Will it really cost me less? If your dataset and usage meet the “Druid Optimal Characteristics” test, the answer is a resounding YES.
Taking a birds-eye view, assuming your dataset and query characteristics meet the test, you can expect that at the same cost, Druid will respond to queries at least 10 times faster than your 2nd-gen cloud warehouse. Benchmark data is too detailed for this short article, but the reality is that if your analytics use case meets the “Druid Optimal Characteristics” test, queries that take many seconds in a 2nd-gen cloud warehouse will take less than a second on Druid. The price performance advantage becomes greater the more queries you perform.
For example, if you perform 1000 queries per month, to hit your sub-second query latency you might find that Druid is about 70% of the cost of a 2nd-gen cloud warehouse. If you perform 10,000,000 queries a month, the cost will drop to 0.1% of the cost. Yes, that’s 0.1%! If your analytics requirements are in the sweet spot of Druid, you can achieve high speed queries across very large data at a small fraction of the cost to do the same operations with a 2nd-gen cloud warehouse. The takeaway here is that if you do lots of queries, the price per query will be much less with Druid than with a 2nd generation warehouse such as Snowflake or BigQuery.
Apache Druid optimal characteristics checklist
Ok, so let’s talk about the magic characteristics which enable Druid’s very significant price performance advantage over 2nd-gen cloud warehouses.
- Insertion rates are very high. High insertion rates generally means millions or even billions of events every day. Druid will perform on smaller amounts of data, but so will a 2nd-gen cloud warehouse, so your Druid “standup costs” may not pay off if you aren’t loading up a lot of data.
- Updates are infrequent and if they occur, can be batched and applied on some sort of “infrequent” schedule such as weekly or monthly. Updating Druid is costly and can impact performance. If you need to make frequent updates to your data, Druid may not be for you, unless you can adjust your update process.
- You are targeting sub-second query latencies. Druid is all about performance. If you don’t need performance, it’s unlikely that you would want Druid.
- Your data is time series data and your analytical queries are mainly focused on evaluating the data over time. Druid optimizes by segmenting the data based on time. Druid does have other optimizations which increase performance (indexing, intersection support), but If you don’t have time series data, you will want to look further to ensure that Druid will give you a price performance advantage. If you do have time series data (and meet the other requirements), Druid’s time series optimization will impress you in terms of speed and cost for that speed.
- Your queries typically “group by” one or more of your dimensions. Druid uses indexing to make “group by” operations very fast.
- You want counts of very high cardinality dimensions—for example, user ids or order ids—or to do set operations that generate counts, and you are willing to sacrifice the detail values associated with those dimensions. This is important to understand—let’s use an example. Imagine you have customers with customer ids. Do you want to list all of the orders for a particular customer? If so, then you need the exact customer id for each customer. This is not optimal for Druid. You can do it, but you are not going to get the high performance at low cost that makes Druid shine. On the other hand, if you just want to know the number of customers that arrived during a particular date range, or the number of customers in the intersection of two date ranges—now you are working in Druid’s sweet spot!
- You are able to “pre-join” your data so that in Druid it is organized as a single large distributed table. Small lookup tables are available, but in general, the data must be de-normalized during ingestion. This can be done with a variety of ETL tools prior to bringing your data into Druid.
- Many Druid applications are streaming applications and Druid is able to ingest data directly from Kafka streams. However, it can also easily ingest from HDFS, flat files or object storage.
The future of operational intelligence is bright and “hot”
The transition from 1st-gen on-prem RDBMs to 2nd-gen SaaS data warehouses was driven by the need to lower operational overhead. While these 2nd-gen SaaS data warehouses do a phenomenal job of reducing ops overhead, they were developed to power BI reporting tools and “cold analytics”. Query latency can be long (minutes) and scaling improves the ability to support higher query load, but does not improve query latency. These 2nd-gen data warehouses are architected to scale horizontally—more disk and disk and processing supports more data and users, but they are not built for sub-second query latency across big data with hundreds of simultaneous query requests.
Data analytics is a fast moving domain. The 2000’s saw the transition from long-turnaround IT driven insights to fast-turnaround analyst insights. The past ten years have seen the transition from HighOps, where you have a devops team dedicated to your analytics infrastructure, to NoOps, where you focus on your data and no infrastructure management is required.
But what’s next? Gartner’s Top 10 data and analytics trends for 2021 predicts that by 2023, more than 50% of analytics will be driven by data created, managed, and analyzed in edge environments. This trend indicates that data will more often be generated by devices and processing of that data will be more time sensitive.
As companies unlock their data, they will look for new ways to leverage the power of their data in the moment. This will require making massive amounts of data available for real time analytics based decision making. 2nd-gen data warehouses will continue to play an important role in cold analytics, but processing this time-sensitive data will require solutions that support real time streaming and computing on this massive “hot” operational data.
Give Apache Druid a spin on your data
At Rill, we are working to combine the ease and NoOps of 2nd-gen data warehouses with the performance and scale needed for operational analytics. If you want to complement your current 2nd-gen data warehouse with operational speed on your real time data, I invite you to try Druid on Rill to see whether Rill’s fully managed cloud service for Apache Druid is a fit for your operational analytics needs.