Skip to main content

DataStax cofounder on evolving Cassandra for modern workloads

DataStax
Image Credit: DataStax

Join us in Atlanta on April 10th and explore the landscape of security workforce. We will explore the vision, benefits, and use cases of AI for security teams. Request an invite here.


Let the OSS Enterprise newsletter guide your open source journey! Sign up here.

Today, DataStax is a database-as-a-service company that offers to store “massive data on multiclouds.” But its roots were a bit more humble. The startup was born more than a decade ago, just after Facebook released a NoSQL key-value database called Cassandra as an open source project. At the time, Cassandra was just an internal tool Facebook engineers built in Java for managing some of the endless data collections that were part of its mission. It was built to be a simple and very fast way to squirrel away some key-value pairs. In open source, Cassandra quickly took on a new life as the NoSQL movement willingly traded perfect accuracy for better speed.

DataStax nurtured Cassandra, helping build an open source community while also expanding the tool to work across multiple platforms and clouds. Added features included enhanced synchronization and replication across the datacenters and clouds. Now it’s not just a tool for handling big datasets on a few machines in your datacenter. It’s designed to juggle massive sets spread across the world.

Late last spring, the company welcomed another round of investment from Goldman Sachs. We sat down to talk with Jonathan Ellis, one of DataStax’s founders, to learn a bit more about where Cassandra began and just how DataStax is leading it into the future. We wanted to know what makes Cassandra a good fit for this big job and what DataStax needed to do to bolster its value.

VB Event

The AI Impact Tour – Atlanta

Continuing our tour, we’re headed to Atlanta for the AI Impact Tour stop on April 10th. This exclusive, invite-only event, in partnership with Microsoft, will feature discussions on how generative AI is transforming the security workforce. Space is limited, so request an invite today.
Request an invite

This interview has been edited for clarity and brevity.

VentureBeat: So you’ve been working on Cassandra for a while now, right?

Jonathan Ellis: I got involved with Cassandra shortly after Facebook open-sourced it in 2008. I was the first non-Facebook committer on the Apache Cassandra project, and basically Facebook said, ‘Hey, it does what we needed to do, you know, so enjoy.’ And so I did community development as well as writing code for super-early Cassandra. About a year and a half into that, I started DataStax to raise investment to accelerate the progress and commercialize it.

Just a little bit later, I built an object storage system for a backup software company. We created an in-house version that was specialized for handling backups. That means it’s OK for it to be high latency as long as it’s high throughput. We were optimizing for writes more than reads. You only need to read from the backup if you lose your local copy. And so the object storage part was relatively straightforward and scaling is fairly straightforward.

But the challenge we ran into is: How do you map hundreds of thousands of user accounts to billions of files? You couldn’t do that in 2005-2006 with off-the-shelf database software. This was a problem that effectively everyone was going to need to solve: How do I scale my system of record? How do I handle an entire country’s worth of users? How do I handle an entire world? So that’s why I wanted to get into this kind of distributed database space.

VentureBeat: You couldn’t run an off-the-shelf database, but you could build something from open source parts, right?

Ellis: Rackspace hired me to work on that for them, and I got to evaluate the options. At the time there was MongoDB and some systems that you don’t hear about anymore like Voldemort from LinkedIn. I got to evaluate these, and I thought Cassandra had really the best foundation to run with for the next 10-plus years. So that’s how I got involved with building an enterprise Cassandra.

VentureBeat: Now that you’ve explored that path and brought Cassandra into the future, where are you going now?

Ellis: During the second stage we’ve transitioned from being a Cassandra company to being an open data stack company. We don’t just want to give you the database piece of your infrastructure. We want to give you everything that you need to build cloud-based microservice based applications. And so we call that the open data stack, and we’re the first — and so far only, to my knowledge — company to be doing this on top of Kubernetes. We have our core expertise in Cassandra and from that we’re expanding.

VentureBeat: How does Kubernetes help? Doesn’t the basic multi-node Cassandra offer you the ability to split the load into multiple machines?

Ellis: No. We’ve basically pulled Cassandra apart and put it back together on top of Kubernetes. We’ve split apart the Cassandra reads and writes from the storage system. Classic Cassandra does both of these on a single node, and it deals with local storage in a high-performance way. The problem with that is that when you need to add compute capacity, you also have to add the storage capacity as well because they’re coupled together. They’re literally on the same node.

VentureBeat: And if they’re forever bound together, scaling gets expensive, right?

Ellis: If you have a workload that’s very compute bound, then you’re basically wasting money by having to scale the storage investment as well. And vice versa. It’s actually more common to be storage bound than compute bound, but you can’t add more storage without also adding the compute as well. So as part of decoupling this for the Astra DB service [DataStax’s database-as-a-service built on Cassandra], we’ve rebuilt the storage layer to take advantage of object storage like S3 in the Amazon Cloud and similar services in the Google cloud and the Microsoft cloud where we also offer Astra. Now we take advantage of this very cost-effective storage layer, but we can also scale the front-end compute independently from that storage.

VentureBeat: Why do you need to have Kubernetes? What does it add?

Ellis: The ability to give operators an intelligent console to run Cassandra with. In other words, Cassandra has had — justifiably — a reputation of being challenging to run. Not in the sense of unreliable, but there’s a high learning curve. There are periodic tasks that you need to run to keep the cluster healthy. There’s a checklist to go through if a node fails and you need to replace it. There’s another checklist for what happens if you need to grow your cluster, what happens if you need to shrink your cluster, how do you add a new datacenter? And replicate that, as well. Kubernetes allows us to build an operator that automates all these things and reduces the chance that things can go wrong. When you’re a human following a checklist, there’s always that opportunity to make a mistake. Kubernetes is the best way to accomplish that.

VentureBeat: Let’s say a single Cassandra node starts to get overburdened and has too many incoming queries. Kubernetes will just scale it and create a new peer and handle the replication and everything across it?

Ellis: Right. So we’re using Kubernetes in two ways. One is we’re kind of eating our own dog food, with our Astra service that is based on Kubernetes, and then we’ve open-sourced a lot of this in a project called K8ssandra.

VentureBeat: How do you add to this core product?

Ellis: We’re expanding horizontally, but we’re expanding vertically as well. What I mean by horizontally, we’re expanding to adjacent technology. You need a message bus to connect your services together, so we acquired a company called Kesque that had built Pulsar as a service. And we’ve rebuilt that, and we released it as Astra Streaming that’s in open beta as of last month. Now we have the database and we have the message bus. And we’re also moving up the stack. We’re providing an API gateway to this infrastructure. If you’re building a React application or if you’re building a JAMstack application, you can make a rest API call. You can make a GraphQL call rather than having to use the traditional heavyweight Cassandra drivers to access your data. And at the bottom of the stack, like I said, we’re building this on Kubernetes, and we’re building it to take advantage of cloud infrastructure in a native way.

VentureBeat: You mentioned Pulsar, a message bus tool from Apache. How does adding this the DataStax product line help the developer?

Ellis: So fundamentally this class of system is a building block that people need. You want your services to be decoupled from each other rather than making direct API calls from one to another. Decoupling them with the message bus gives you burst capacity. Now service B doesn’t have to be able to process events at exactly the same rate as service A. If there’s a load spike, the message bus can absorb that and let the other services catch up asynchronously. You get a whole bunch of other benefits too, like being able to replay events. Maybe there was a bug in your service and now you roll that out in a new version that has the bug fix. You can replay those events to it. Recovery from that kind of error is just much much easier.

And then you can do realistic testing. You can take the events from your live system, anonymize them, and then pull them into a test environment. Now you have an actual real-world test scenario to run against. The big name in the space was Kafka, and Kafka doesn’t do a number of the things that were important to us — in particular, for its ability to span multiple datacenters. So we’ve embraced Pulsar.

VentureBeat: Are there any forward-looking product plans for DataStax you want to tell us about?

Ellis: I did want to mention that Cassandra 4.0 will be released after something like four years of development. It’s a really big release in the Cassandra world. We put a ton of work into making it observable and adding audit logging and that kind of thing. And it’s also updated to be able to take advantage of the latest JVM improvements around garbage collection.

And so, we’ve done a series of tests with one of our performance engineers. So the slowest 0.1% of requests have gone down from, say, 60 milliseconds to something like six milliseconds. It is roughly an order of magnitude improvement that we’re seeing. So that means huge performance improvements and also great new features built around observability and auditing.

VB Daily - get the latest in your inbox

Thanks for subscribing. Check out more VB newsletters here.

An error occured.