Data Talks on the Rocks

Data Talks on the Rocks 7 - Kishore Gopalakrishna, StarTree

Michael Driscoll
Author
April 4, 2025
Date
5
 minutes
Reading time
Data Talks on the Rocks is a series of interviews from thought leaders and founders discussing the latest trends in data and analytics.

Data Talks on the Rocks 1 features: 

  • Edo Liberty, founder & CEO of Pinecone
  • Erik Bernhardsson, founder & CEO of Modal Labs
  • Katrin Ribant, founder & CEO of Ask-Y, and the former founder of Dataroma

Data Talks on the Rocks 2 features: 

  • Guillermo Rauch, founder & CEO of Vercel
  • Ryan Blue, founder & CEO of Tabular which was recently acquired by Databricks 

Data Talks on the Rocks 3 features:

  • Lloyd Tabb, creator of Malloy, and the former founder of Looker

Data Talks on the Rocks 4 features:

  • Alexey Milovidov, co-founder & CTO of ClickHouse

Data Talks on the Rocks 5 features:

  • Hannes Mühleisen, creator of DuckDB

Data Talks on the Rocks 6 features:

  • Simon Späti, technical author & data engineer

Data Talks on the Rocks 8 features:

  • Toby Mao, founder of Tobiko (creators of SQLMesh and SQLGlot)
  • Jordan Tigani, co-founder of MotherDuck
  • Yury Izrailevsky, co-founder of ClickHouse
  • Kishore Gopalakrishna, founder of StarTree

Data Talks on the Rocks 8 features:

  • Joe Reis, author

Data Talks on the Rocks 9 features:

  • Matthaus Krzykowski, co-founder & CEO of dltHub

Data Talks on the Rocks 10 features:

  • Wes McKinney, creator of Pandas & Arrow

We've had the opportunity to speak in depth with founders of cutting edge technologies. The past few interviews include the creators of real-time analytical databases. For our seventh installment of Data Talks on the Rocks, we found it fitting to interview Kishore Gopalakrishna, co-founder and CEO of StarTree and creator of the wildly popular database Apache Pinot. Kishore and I were able to have a deep technical discussion on the following items:

  • Three things Pinot looked to differentiate on - freshness, latency, concurrency
  • Pinot's real-time benefits and actual use cases across verticals - Uber, Stripe, Walmart
  • Pinot's unique architectural decisions - "index is a first class citizen"
  • Understanding users' needs in product development

I’ve noted some of my favorite highlights below:

On real-time data: (15:42)

“There is this amazing potential in data in the first few seconds, or even first minutes, and if you are able to leverage that and see the value, then it's endless.”
“Uber Freight [is] tracking the progress of truck drivers… providing real time insights to them on like, hey, this is the route you should take, you shouldn't be spending time here. They saw a huge improvement in business in terms of being on time, and then they saved millions of dollars.”

On the value of the open-source community: (39:54)

“When you hear from the users… if you keep an open mind, there is so much value that you can actually get from them… I still spend a lot of time on the Pinot Slack... We wouldn't have done upserts if it was not for Uber.”

On index support: (26:30)

“Most [real-time analytics databases] use the regular indexes, which is like the inverted index or bloom filter… some of them like ClickHouse, for example, have skip indexes, but it doesn't even have the row level indexes. We go way beyond that.”

On query efficiency: (24:51)

“Look at all these databases… what is the actual work that is being done [per query]? Pinot does the least amount of work, and that was the goal that I was actually shooting for… being able to have that [low] latency curve maintained, as you add more data, as you add more queries per second was a challenge that I took on while most people said… don't try to solve that problem.”

On updates: (20:27)

“[Update support] took multiple tries for us to actually get this architecture. Especially on the upserts… we did one version that didn't actually work out well, so we had to redesign that again…. we are glad that we finally got it right.”

On SQL join support: (35:00)

“One of the things that we focused heavily on [was] user facing, external facing applications. But over the last 2 years we have seen huge pull on all the internal applications as well… all of these started with us adding the support for joins. [This] has become a huge strength, and we are pretty much beating every other system out there in terms of join performance.”

In a conversation with StarTree founder Kishore Gopalakrishna, one theme kept surfacing: real-time analytics is no longer just about faster dashboards. It is about changing what software can do.

For years, analytics systems were built around a familiar compromise: collect data now, understand it later. Maybe later meant minutes. Maybe hours. Sometimes a day. For internal reporting, that was often good enough.

But modern applications have changed the stakes.

When a user opens Uber Eats, the delivery estimate should reflect what is happening right now, not what happened 30 minutes ago. When a Stripe merchant checks their business, they expect live visibility into payments and operations. When a logistics company routes drivers or a bank monitors transactions, the value of information decays by the second.

That is the world Apache Pinot was built for.

In a recent episode of Data Talks on the Rocks, Rill CEO Michael Driscoll sat down with Kishore Gopalakrishna, founder and CEO of StarTree and creator of Apache Pinot, to talk about why Pinot exists, what makes it different, and how real-time analytics is reshaping both product design and data infrastructure.

Why build a new database at all?

Database origin stories often sound inevitable in hindsight. This one did not.

As Kishore put it, he did not want to build another database. By the time Pinot emerged at LinkedIn, he had already built distributed systems before, including Espresso, a NoSQL key-value system. The team already had an initial solution based on Elasticsearch. The instinct, naturally, was to extend what existed rather than start over.

But there is a moment in many engineering projects when incremental fixes stop working. You can add hacks. You can push a system a bit further. But eventually the fundamentals push back.

That is what happened.

The core challenge was not simply building “faster analytics.” LinkedIn wanted something much more demanding: OLAP at OLTP scale. That meant combining analytical queries with the kind of latency and concurrency users normally expect from transactional systems. In practical terms, the bar was not measured in seconds. It was measured in milliseconds.

For a product like Who Viewed My Profile, sub-100ms latency was not aspirational. It was table stakes.

That requirement led Pinot in a different direction from the beginning.

The Pinot thesis: freshness, latency, concurrency

The interview makes clear that Pinot’s architecture was shaped by three interlocking goals.

The first was data freshness. If an event lands in Kafka, it should become queryable immediately. Not after micro-batching. Not after being staged somewhere else. The whole point of streaming infrastructure is to preserve the value of data in motion.

The second was query latency. Pinot was designed around the idea that user-facing analytics must feel interactive. Not merely “fast for BI,” but fast enough to sit behind an application experience.

The third was concurrency. It is one thing to answer a dashboard query quickly for one analyst. It is another to answer hundreds of thousands of concurrent queries for real users in production systems.

That combination is what distinguishes Pinot’s role in the ecosystem. It is not just an OLAP engine for internal exploration. It is infrastructure for operational, embedded, customer-facing analytics.

And that changes everything.

Real-time data changes the product, not just the dashboard

One of the strongest ideas in the conversation is that a technology like Pinot does not simply improve analytics. It can change the way a business operates because it changes what product teams believe is possible.

Kishore described how this played out beyond LinkedIn.

At Uber Eats, live query infrastructure powers delivery estimates based on current orders, not stale historical averages. At Stripe, merchants can view live operational metrics through analytics powered behind the scenes by Pinot. At Walmart, order and fulfillment tracking flows through Pinot as orders move through state transitions like placed, inventoried, and shipped.

In each case, the use case begins with real-time data infrastructure, but it ends in a very different place: better product experiences.

That is an important shift. The business value of real-time systems is not only internal visibility. It is external responsiveness.

And once teams experience that shift, they rarely go backward.

As Kishore noted, many companies start by adopting Kafka or another streaming system as a better ETL backbone. At first, they move data quickly into a slower environment like a data lake. Eventually they realize they have created a contradiction: a fast data pipeline feeding a slow analytics surface.

That is often the moment when the next question appears: what if we could act on the data while it still matters?

Working smarter, not harder

A major theme in the discussion is that Pinot was designed not just to run fast, but to do less work.

That distinction matters.

Many analytical systems get faster by relying on better hardware, faster SSDs, stronger networks, or brute-force scanning at larger scale. Pinot’s philosophy was different: if a query does not need to scan a row, it should not scan it. If a segment does not need to be touched, it should not be touched. If work can be avoided physically, not just logically, it should be avoided.

This is where indexing becomes central.

Pinot treats indexes as a first-class architectural principle, not an afterthought. That includes standard structures like inverted indexes and bloom filters, but it goes further with capabilities like range indexes, JSON indexing on nested fields, geospatial support using H3, and the distinctive StarTree index.

The StarTree index is especially interesting because it behaves like a highly selective, intelligent materialized view. Rather than precomputing everything, it selectively aggregates the parts of the data most likely to create expensive queries at runtime. In other words, it is not just about pre-aggregation. It is about pre-aggregation with judgment.

That design choice reflects a broader Pinot instinct: optimize for predictable performance under real-world workload variability.

A query for a high-cardinality or high-volume slice of data should not suddenly collapse the user experience just because more rows match. Pinot’s indexing model tries to flatten that variability and create more consistent latency across workloads.

The contrarian bets that paid off

Every meaningful system design starts with a few bets that look risky at the time.

For Pinot, one of those was committing early to a columnar architecture while still insisting on real-time freshness. That raised obvious objections: how do you handle updates? How do you avoid freshness penalties? How do you reconcile a column store with mutable, messy event streams?

Another bet was support for upserts without sacrificing read performance.

Kishore explained that many systems resolve these conflicts during reads, using LSM-style tradeoffs that defer some work until query time. Pinot made the opposite decision: resolve the conflict during writes so reads stay fast. That is a harder path operationally, but it preserves the low-latency behavior Pinot was built for.

The result is not only support for the latest state of a record, but also a changelog-like history of how that record evolved over time.

That matters because real-world data is not clean or immutable. Late-arriving updates happen. Corrections happen. Operational systems change state. In practice, modern analytics infrastructure has to be able to handle that messiness without giving up on speed.

Joins, object storage, and Pinot’s next chapter

The conversation also touched on how Pinot is evolving.

For years, Pinot’s core reputation was tied to user-facing real-time analytics. More recently, it has seen growing demand for internal analytical workflows too, including product analytics and funnel analysis. A major enabler there has been improved join support.

Again, the interesting part is not just that Pinot added joins, but how it approached them. Instead of copying the default distributed SQL playbook, the team focused on reducing the two things that make joins expensive: scanning and shuffling. By leaning on indexes, placement strategies, and dynamic filtering, Pinot aims to make joins interactive rather than merely tolerable.

The same architectural mindset shows up in Pinot’s relationship to object storage.

As the broader data ecosystem moves toward Iceberg, Delta Lake, and S3-backed architectures, Pinot is not standing apart from the trend. But it is engaging with it from a very particular angle: how do you preserve low-latency analytical performance without pulling unnecessary data across the network?

Kishore described StarTree’s tiered storage work as a way to answer that question. Rather than lazily loading entire files, Pinot can use indexes to fetch precise byte ranges from object storage. That means its earlier decisions about segment structure and index separation now pay unexpected dividends in a very different deployment model.

It is a reminder that good systems architecture often creates optionality you do not fully appreciate until much later.

The best roadmap signal might be your users

Toward the end of the conversation, Michael brought up one of Kishore’s memorable remarks: you don’t need a product manager when you have a community.

Taken literally, that line is easy to debate. Taken seriously, it points to something important.

Pinot’s roadmap has been deeply shaped by the people using it. Not just by feature requests, but by the problems hiding underneath them. Uber’s need for mutable ride fares helped push Pinot toward upserts. Community questions on Slack continue to expose new use cases and new edges of the product.

The lesson is not that product management does not matter. It is that the strongest product intuition often comes from listening carefully to the surprise in the room.

Users do not always know the right solution. But they usually know the shape of the problem. And in infrastructure especially, the “why” behind the request often matters more than the exact feature being asked for.

That openness to surprise seems deeply tied to Pinot’s evolution. It is part of why the system has expanded beyond its original use cases and into a broader real-time analytics platform.

Real-time analytics is becoming product infrastructure

What makes this conversation so timely is that Pinot’s story is no longer unusual. Across the industry, teams are discovering that analytics is moving closer to the application surface.

Users expect live operational context. Product teams want metrics embedded directly into workflows. Businesses want systems that react in seconds, not after the fact.

That does not mean every workload needs Pinot. It does mean the market increasingly needs systems built around the assumptions Pinot embraced early: fresh data, low latency, high concurrency, and physical efficiency.

For years, analytics was something you did after the system ran.

Now, increasingly, analytics is part of how the system runs.

That is the shift Apache Pinot helped pioneer — and it is still unfolding.

Ready for faster dashboards?

Try for free today.