Why We Built Rill with DuckDB
The obsession with DuckDB within the data community has at times bordered on a cult following, often to the bafflement of outsiders. Here at Rill, we’ve had a longstanding crush on this analytics database, and in late 2021 we decided it was the perfect engine to power a conversation-fast data profiling and dashboard tool, Rill Developer.
If you’re considering leveraging DuckDB for your own application, what follows is a level-headed perspective on why we felt its features were a unique fit for Rill’s product requirements. We hope this might provide some pragmatic insights for other developers evaluating DuckDB for their own data applications.
What is DuckDB?
DuckDB is a doppelganger of sorts to SQLite, the world’s most widely deployed database. Like SQLite, DuckDB has a lightweight install footprint, no external dependencies, and runs in-process – making it easy to embed into other applications. However, unlike SQLite, which is optimized for transactional workloads (OLTP), DuckDB is optimized for read-oriented, analytical workloads (OLAP).
To say that DuckDB is optimized for analytics is an understatement. DuckDB was originally created by Mark Raasveldt and Hannes Mühleisen in 2019, researchers at Amsterdam’s CWI, which has pioneered many of the advances in vectorized execution engines and column-oriented architectures, going back to MonetDB in the 1990s. DuckDB was created by giantly-talented developers standing on the shoulders of giants.
Rill’s Requirements for a Database Engine
Rill Developer is an open-source, code-first BI tool that enables software engineers to profile data sets, design data models, and deploy dashboards to Rill Cloud.
Our mission is to make it possible for a developer to get from a data lake to a cloud dashboard in a few minutes. To achieve that goal, every step on this journey must be fast. Specifically, our requirements for a database engine in our product were:
- Be easily embeddable - Rill Developer is a standalone binary the developers can download and run locally, enabling a “close-to-the-metal” experience. We needed a database engine that was lightweight and easy to embed.
- Profile 10 GB data sets in seconds - Rill Developer automatically profiles data sets as large as 10 GB, to save developers time, firing off 100s of queries calculating cardinality, nullity, top-K frequent elements, quantile distribution, and other summary statistics. For the experience to feel interactive, each of these profiling queries must return in milliseconds.
- Execute complex SQL in milliseconds - Rill Developer’s data modeler has no run button, so complex SQL models involving JOINs or GROUP BYs – as well as chains of these models – must be validated and executed on every keystroke.
- Support moderate query concurrency - Rill Cloud dashboards lack profiling or complex modeling features, but unlike the single-player-mode of Rill Developer, they have a concurrency requirement. We estimate this “moderate” concurrency requirement would need to support dozens of users, each generating tens of queries every few seconds with exploration actions.
Given these requirements, why did Rill choose DuckDB?
DuckDB is the Fastest Database for Single-Node Data Applications
The biggest differentiator for us in choosing DuckDB was its uniquely high performance for analytics queries. SQLite is a more mature database engine, and arguably more easily embeddable than DuckDB. (Postgres and MySQL are certainly lightweight engines, but they are not embeddable in the sense of an in-process, loadable database library).
For single-node applications operating on less than 100GB of data (which covers the vast majority of real-world use cases), DuckDB is routinely benchmarked as being the fastest analytics database anywhere. Hex uses DuckDB internally for its dataframe SQL, and Mode Analytics recently switched to DuckDB for its in-memory Helix engine. Our internal benchmarking confirmed previous reports that DuckDB outperforms SQLite on various analytics queries by an order of magnitude (ranging from 3x to 30x).
For Rill Developer, this performance breakthrough enabled us to fulfill our requirements for automatically profiling hundreds of columns and providing keystroke-by-keystroke updates of SQL models – without a run button.
For Rill Cloud dashboards, we’ve been pleased with the performance of DuckDB, although we have yet to benchmark performance at the upper bounds of scale and query concurrency (more on that in a future blog post).
DuckDB has a Delectable SQL Dialect
Rill’s BI-as-code philosophy means SQL is at the center. In Rill Developer, our users spend the bulk of their time editing the SQL-defined data models that power dashboards. These typically include joins across fact and dimension tables, and aggregate expressions for metrics like sums, averages, quantiles, and distinct counts.
DuckDB’s SQL dialect is built on the Postgres standard but with a delectable sprinkle of syntactic sugar for common analytical expressions.
DuckDB has the potential to become a universal data connector
One of the largest obstacles to using any database system is simply getting data into the tool. DuckDB has expansive support for loading data from the most commonly used file formats like CSV, JSON, and Parquet. Notably, DuckDB continues to deepen its support for parsing nested JSON, since nested JSON data – for better or worse – is what most source systems emit. Postgres and SQLite databases can also be queried directly through DuckDB, through the use of its scanner extensions.
DuckDB’s architecture enables it to operate on external data sources without needing to copy this data into DuckDB’s native storage format. This approach offers enhanced consistency, as there is no need to sync data between systems. It also offers increased efficiency, since DuckDB can push predicates, like a ‘WHERE’ clause, all the way down to source systems. Practically speaking this means that DuckDB can query Parquet files stored in a data lake, while only processing a narrow subset of the data.
As additional file format and database scanners emerge, DuckDB’s capabilities offer a universal portal through which data applications can connect with vast varieties of data. This is similar to the connector capabilities that tools like Apache Trino offer, but without the massive overhead. Rill Developer and others need not reinvent connectors, but rather float on the rising tide of DuckDB’s ingestion capabilities.
DuckDB has a Vibrant Community, Embraces Open Standards, and is Open Source
As an early adopter of any new technology, one of the biggest considerations is not only its present state but a sense of where it’s going. For proprietary technologies, the perceived health of the company matters, but for open source tools, it’s the health of the community. In this regard, DuckDB stands out for the vibrancy and vitality of its developer community. The pace of innovation in the project is breathtaking, with features proposed and implemented in a matter of weeks.
Moreover, DuckDB has embraced many other open standards in the data ecosystem, in its prioritization of file formats and databases (mentioned above) as well as its client API languages. Finally, DuckDB’s choice to use the open-source license (MIT License) enables worry-free embedding for developers.
Pushing the Limits of Scale
Choosing DuckDB has been one of the easiest design decisions for our team. It’s a natural fit for both the user experience requirements of auto-profiling and interactive modeling in Rill Developer, and exploratory dashboards in Rill Cloud.
In the coming months, we look forward to pushing the limits for what’s possible with DuckDB when it comes to speed at scale, while continuing to actively participate within an already impressive community of developers.
Want to see how fast and simple it is to create dashboards with Rill? You can get started in less than 2 minutes with our install script (Mac, Linux).
Gaffney K. (2022). SQLite: Past, Present, and Future. Proceedings of the VLDB Endowment, 15(9), 1057-1060.
Fraser, G. (2021). How Fast is DuckDB Really?, Fivetran Blog.