Full-stack scalability - The benefits of building on a highly elastic data layer

Alex Stride

Senior Machine Learning Engineer

Alex is a Senior Machine Learning Engineer in Featurespace Labs. He joined the team in July 2021 and works on designing and building the next generation of products for Featurespace. Before joining he worked on question answering systems for Amazon Alexa, and has experience from other start-ups in the area of Natural Language Processing. Alex holds a degree in Linguistics & Law from Cambridge University and started programming when he chose to specialise in Computational Linguistics. He also briefly taught Computer Science in a secondary school before becoming an engineer. Outside of work Alex enjoys tinkering with new programming languages, cooking, cycling and walking his dog, Bella.

Connect

Scalability has to start from the data

The gradual movement of all software from dedicated hardware to virtual machines in the cloud has massively changed the way that software is written. It is now the norm to divide up services into containerized microservices and deploy them in an environment like Kubernetes, which can automatically scale those services up and down, provisioning new containers and requesting new virtual machines when necessary to meet demand. Completely managed, “serverless” offerings are also an increasingly common part of the modern technology stack. However, many applications still confine dynamic scaling behavior to the stateless parts of their systems, while reaching for non-scalable databases like MongoDB or PostgreSQL when they need to store state.

The issue with deploying a non-scalable (or at least difficult to scale) database alongside otherwise elastic infrastructure is that the data layer can easily bottleneck the whole system, making the elasticity of the rest of the system redundant. There is no point having an infinitely scalable, containerized API if all those containers are sitting around waiting for the database to respond. A truly elastic service needs a truly elastic data layer and serverless cloud databases provide exactly such a data layer, plus several benefits besides.

Why are scalable databases not the norm?

Aside from pure familiarity, I believe there are a few reasons that people do not look towards fully scalable, serverless databases more often.

Reliance on SQL

The most common scalable databases are all NoSQL (e.g. Bigtable, DynamoDB, Cosmos DB & Cassandra). Distributed and scalable SQL offerings are available, but they are much less common, and they are usually not drop-in replacements for databases like PostgreSQL and MySQL. Therefore, companies which lean heavily on SQL queries find it much more difficult to adopt more distributed and resilient solutions.

Data modelling for NoSQL databases is difficult and less widely understood than modelling relational database schemas. This leaves a lot of people thinking that their application requires the ability to run SQL queries at a fundamental level when it is not really the case. Those engineers could likely increase reliability and scalability by taking the time to remap their query patterns to use a NoSQL solution.

Concerns about vendor lock-in:

Companies, particularly large ones, tend to be wary of leaning on technology which is specific to a single vendor. This is a valid general concern, but for the most part it should not be an issue with serverless NoSQL databases. From the point of view of the user, the most popular cloud databases are predominantly simple, distributed key-value stores. Of course, they each have more advanced features which may be more idiosyncratic, but the core of a data layer built on, say, AWS DynamoDB should be fairly simple to port over to Azure Cosmos DB, self-managed Apache Cassandra, or any other NoSQL data store. One of the key benefits of a NoSQL data model is its portability.

Concerns about performance:

In our experience of using DynamoDB, we see read and write latencies which are reliably around 5milliseconds. While there are applications in which lower database latency is critical, in most well-architected systems, this level of performance is perfectly adequate. Cloud offerings often also have well-integrated caching solutions for those who need to drive down latency further.

Concerns about billing:

Much has been written (some of it by rival proprietary data store providers) about the danger of cloud database costs spiraling out of control. In my view, these are fears generated based on a few examples of companies misunderstanding pricing models or misunderstanding how you are supposed to use a NoSQL database. People who are generally uncomfortable with cloud billing models will also be uncomfortable with the way that cloud databases are billed, but it is my strong belief that most applications would see a large reduction in price by moving to a serverless cloud database, even in the case of large, global software services. This is due to the ability to easily manage elastic scaling in the way that I will describe below.

Leveraging complete scalability

At Featurespace, we design fraud systems with highly variable throughput requirements. We mostly need to process events coming into the system at a baseline rate, but occasionally we need to run a batch operation which performs a lot of database reads and writes in a short amount of time. We face this problem because we have to update records for a large number of customers or merchants from batches of data, but the problem of highly variable throughput is common to almost all applications. Many applications have to perform some sort of bulk operations periodically and it is important that the database can continue to serve production traffic during these times of peak activity.

If you are using a non-scalable database, then the only way to avoid overloading that database during high-throughput batch operations is to throttle those operations to a level where you know standard operations are not in jeopardy. In addition, you must make sure that your database has been set up with enough capacity to support regular production traffic plus some extra to accommodate some batch operations happening in the background. Not only does this result in having to foot the bill for large database deployments running 24/7, but it is also likely you will have to significantly slow down the processing of batches to avoid overwhelming the database.

With a completely scalable, serverless database these problems can be solved. We have recently been developing a proof-of-concept system which stores state in AWS DynamoDB, spread across a number of separate tables. For each of the tables we dynamically turn the provisioned throughput up and down, based on how much throughput each table needs for its current state. For example, we might have a table of customer states, supporting a single classifier model, using only 20 reads and 10 writes per second. But, if a batch of data lands in the system and we need to update the state for every single customer, we can simply increase the provisioned capacity for that table to 20,000 reads and 10,000 writes per second and keep it provisioned at that level until the batch has finished being ingested.

DynamoDB is highly unlikely to ever have capacity issues caused by a single client, because behind the scenes all the infrastructure for all DynamoDB customers comes from a huge common pool and AWS can spread each customer’s data over as many of those servers as required to support the requested capacity.

We can, therefore, effectively set our throughput arbitrarily high for short periods, ensuring that we maintain the capacity we need to ingest the batch of data quickly without jeopardising the regular operation of the system. Of course, we will still throttle requests to the table to some extent, but we can reach a much higher throughput and complete the job in a fraction of the time needed by a non-scalable database provisioned for normal traffic levels.

Note also that because we can use different tables for different types of data, the increase in load on one customer table cannot affect the throughput of other tables serving different types of data in production because each DynamoDB table has its own metered throughput.

Unlocking value for the business

By applying this type of highly dynamic scaling behaviour at the database level, we expect to be able to create systems which have database costs in orders of magnitude lower than equivalent systems using non-dynamically scalable databases.

At the same time, because of the huge throughput capabilities of DynamoDB, we expect to speed up certain batch operations by an order of magnitude. Moreover, the serverless nature of DynamoDB, and the fact that it is vended by AWS itself, means that the operational and administrative burden of working with this database is significantly less than even a managed cluster from a 3rd party vendor.

thong-vo-Maf7wdHCmvo-unsplash-glitched-12-10-2021-14-37-02

Scaling databases

Altogether, a huge amount of business value can be gained by dynamically scaling databases and removing them as a bottleneck in the system, at times when they need to do large amounts of work. The power of building systems on top of serverless offerings is that we can not only leverage the potential for highly elastic infrastructure, but we can do so without having to manage any of that infrastructure ourselves, safe in the knowledge that those systems are built for extremely high levels of reliability.

Impact of infrastructure

The impact of infrastructure moving to the cloud has been the start of the software industry’s move along the path of building flexible architectures which grow with business demand, but we still have a long way to go. Leaning on architectures which are fully scalable, from top to bottom, means that engineers will be able to deliver more value with less effort, less maintenance and lower cost.

Full-stack scalability – The benefits of building on a highly elastic data layer