Apache Druid is purpose built to generate high performance at low cost on a set of use cases that are becoming increasingly common, known as Operational Analytics. Operational Analytics, also referred to as “Continuous Analytics”, involves analyzing real time data to make real time decisions. It’s “analytics on the fly”—business signals are processed in real time and then fed to decision makers who make appropriate adjustments to optimize the business activity.
When your use case meets a set of characteristics that make it optimal for Druid, Druid’s data engine can generate extremely fast results which enable the fast decision making required by operational analytics.
We are living in an era where data rules and no matter what product or services your company produces, you are almost certainly using data driven analytics to understand your customers and the performance of your product or services. As the sheer volume of data and need for analytics has grown, we’ve seen analytics quickly transition from first-gen on-prem RDBMS such as MySQL, Postgres, and Oracle, to second-gen SaaS data warehouse services such as Google BigQuery, Snowflake, and Amazon Redshift. These SaaS data services provide excellent NoOps solutions to a company’s data management problems, removing the need for DevOps and allowing a business to focus on its performance, rather than the management of its data.
But as the data has grown, so have the use cases. While many data teams breathe a sigh of relief once they transition their data into the NoOps world of BigQuery or Snowflake, the reality is that the diversity of use cases calls for a variety of solutions.
To give an example, think about the difference between analyzing:
Each has its own intricacies. The retail data analytics requires a database that supports updates and the goal may be to generate weekly reports. The social media marketing campaign requires the ability to make decisions based on data generated over the past 24 hours. Finally the clickstream analytics must support streamed-in data with sub-second decision making.
Before we discuss the characteristics that make your use case optimal for Druid, let’s talk about performance. You might be wondering, will I really see a difference in performance if I move from my 2nd generation cloud warehouse (Snowflake, BigQuery, Redshift) to Druid? Will it really cost me less? If your dataset and usage meet the “Druid Optimal Characteristics” test, the answer is a resounding YES.
Taking a birds-eye view, assuming your dataset and query characteristics meet the test, you can expect that at the same cost, Druid will respond to queries at least 10 times faster than your 2nd-gen cloud warehouse. Benchmark data is too detailed for this short article, but the reality is that if your analytics use case meets the “Druid Optimal Characteristics” test, queries that take many seconds in a 2nd-gen cloud warehouse will take less than a second on Druid. The price performance advantage becomes greater the more queries you perform.
For example, if you perform 1000 queries per month, to hit your sub-second query latency you might find that Druid is about 70% of the cost of a 2nd-gen cloud warehouse. If you perform 10,000,000 queries a month, the cost will drop to 0.1% of the cost. Yes, that’s 0.1%! If your analytics requirements are in the sweet spot of Druid, you can achieve high speed queries across very large data at a small fraction of the cost to do the same operations with a 2nd-gen cloud warehouse. The takeaway here is that if you do lots of queries, the price per query will be much less with Druid than with a 2nd generation warehouse such as Snowflake or BigQuery.
Ok, so let’s talk about the magic characteristics which enable Druid’s very significant price performance advantage over 2nd-gen cloud warehouses.
The transition from 1st-gen on-prem RDBMs to 2nd-gen SaaS data warehouses was driven by the need to lower operational overhead. While these 2nd-gen SaaS data warehouses do a phenomenal job of reducing ops overhead, they were developed to power BI reporting tools and “cold analytics”. Query latency can be long (minutes) and scaling improves the ability to support higher query load, but does not improve query latency. These 2nd-gen data warehouses are architected to scale horizontally—more disk and disk and processing supports more data and users, but they are not built for sub-second query latency across big data with hundreds of simultaneous query requests.
Data analytics is a fast moving domain. The 2000’s saw the transition from long-turnaround IT driven insights to fast-turnaround analyst insights. The past ten years have seen the transition from HighOps, where you have a devops team dedicated to your analytics infrastructure, to NoOps, where you focus on your data and no infrastructure management is required.
But what’s next? Gartner’s Top 10 data and analytics trends for 2021 predicts that by 2023, more than 50% of analytics will be driven by data created, managed, and analyzed in edge environments. This trend indicates that data will more often be generated by devices and processing of that data will be more time sensitive.
As companies unlock their data, they will look for new ways to leverage the power of their data in the moment. This will require making massive amounts of data available for real time analytics based decision making. 2nd-gen data warehouses will continue to play an important role in cold analytics, but processing this time-sensitive data will require solutions that support real time streaming and computing on this massive “hot” operational data.
At Rill, we are working to combine the ease and NoOps of 2nd-gen data warehouses with the performance and scale needed for operational analytics. If you want to complement your current 2nd-gen data warehouse with operational speed on your real time data, I invite you to try Druid on Rill to see whether Rill’s fully managed cloud service for Apache Druid is a fit for your operational analytics needs.