Signify contacted us as they aimed to have a new system that could process their marketing information to multiple output channels, for example, a website. This new system had to push content updates on an irregular basis.
After some brainstorm sessions, we proposed a solution which was mostly serverless and pay-as-you-go. However, it had to overcome many challenges. By designing this system as a serverless one we managed to overcome quite some challenges as from the start:
The number of updates each month is very unpredictable. However, we don’t want to pay for expensive servers which are running all the time.
One of the requirements from Signify involved an easy extendable system. We managed to create the architecture in a way there is a minimal effort needed to add a new output channel which can be built in a matter of weeks.
On top, we had to ensure that, although highly distributed, the system doesn’t overwrite new updates with old ones. All of this should be accomplished without a performance penalty.
Well, building the system with a serverless architecture didn’t only solve challenges, it also created one: the logging of the entire solution. Instead of consulting each microservice to read the logging, centralized logging of the distributed system was necessary.
The image below represents a high-level overview of the architecture. The first thing to notice is we receive data from a master data management (MDM) system called Stibo STEP. This MDM system is set up to detect changes using events on the product data and export these changes to the ‘Ingestion layer’ of our product content publication system. When we receive the raw data from the PIM system, we start processing the data and transform it into standard models. Subsequently, these models trigger the creation of new product data for the different output channels.
One of the early challenges we had to deal with, was to create the system as a fully event-driven system while ensuring in-order processing. This to guarantee an old update would never be overwritten by a new one. Also, this process should support an unpredictable load.
As we mentioned before, the PIM system exports data to our ingestion layer. This means the PIM system writes the data into a few DynamoDB tables. These tables in turn automatically trigger an AWS Lambda which is going to execute the first transformations, resulting in a standard model. The architecture depicted below was the first iteration to receive this goal.
We quickly discovered this wasn’t sufficient to provide the performance that we expected because a DynamoDB stream only triggers a single Lambda execution and won’t concurrently start multiple Lambdas. We improved the architecture by adding a routing pattern combined with Kinesis streams which solved this issue. The result is shown in the image below.
As you can see, we’re using a Lambda to put data on a Kinesis stream. If there is a burst in the number of events, these streams will scale accordingly by adding more shards. Each shard will trigger a single Lambda execution, resulting in as much Lambda containers as there are shards, making the system very scalable. To ensure in-order processing of the event, it’s important to choose your Kinesis partition key wisely. With this key you ensure that you send a data transformation event of a particular object consistently to the same shard, otherwise, you could break the order of the updates. Since a Kinesis stream uses the first in first out (FIFO) principle per shard, it is guaranteed that events will be processed in order as long as events concerning the same object are sent to the same Kinesis shard. This router Lambda doesn’t contain much logic ensuring it works extremely fast.
With this new architecture, the performance was much better, and we overcame the first challenge — at least we believed so. However, after running some performance tests with this implementation, we discovered we were hitting AWS limits! We had so much concurrent Lambda executions writing custom metrics to CloudWatch we were hitting the CloudWatch API limits. We decided to add a limit to the number of shards for the Kinesis streams to prevent this. It could also have been solved by requesting a limit increase via AWS support. Whether to fix the problem by requesting a limit increase or limiting the number of shards depends on the requirements of the system. In our case, the performance with the limited number of shards was sufficient for the requirements. Hence, we could state that this first challenge was solved for real!
Another challenge we faced was creating an easily extendable system. As we’re using an event-based serverless system, this was effortless to achieve by ensuring enough separation of concerns in the serverless functions. To reach this goal, we made use of DynamoDB streams. This means that when there’s an update in a DynamoDB table with streams enabled, it will stream data to another endpoint. In our case these endpoints are just other Lambda functions which will execute a process. In reality, this means that we are just adding new DynamoDB streams on the standard model. This implies that there are no code changes required to the existing code. When talking about being ‘easily extendable’ for this system, we’re talking about easily adding new output channels. Going back to the high-level overview, you can see we go from the standard model to different output channels.
Diving a little deeper here, this means that a standard model which is saved in a DynamoDB table has a DynamoDB trigger to a new Lambda function specifically made for a particular output channel. You can add new output channels without modifying existing code.
With all these different Lambda functions it’s quite hard to manage the logging of the system and to get an overview of the states of the product updates. When using multiple compute engines it’s almost always better to have a central logging system where you can collect the logging. In our product content publication system, we use Elasticsearch as our central logging system combined with a front-end dashboard where you can track the update state of an object or see the status of the entire system. To write logs in our Elasticsearch we’ve connected all of our CloudWatch log groups to a Kinesis stream which streams data to a particular Lambda function. This Lambda function was specifically written to only send meaningful logs about product states and the state of that compute engine to Elasticsearch.
This challenge is a direct result of using a serverless architecture because it’s also a microservice architecture. However, the advantages of a serverless micro-architecture still outweigh the problem this creates, especially when it is solvable without impacting the performance of the system. Now we can visualize the state of a product and the journey of the updates throughout the entire system.
Our product content publication system is suitable for continuous improvement. New or changed AWS features can be easily added to it. This implies that we’re still changing the architecture from time to time to increase performance and reduce costs. If we have a look at the costs of the system in the last few months, we can already see a decreasing cost due to new insights and features! Below are some important findings we had when looking for continuous improvements.
One of the interesting improvements we looked for is the relationship management between different kinds of objects. If, for example, an attribute gets an update, we also have to regenerate all the products which are using this attribute. Before, we had a complex setup to maintain the relationship between objects and propagate updates from one object to another. If you’re thinking about complex relationships between objects and how to handle them, Graph database is often the answer to your prayers. In this case, we used AWS Neptune to replace our old and difficult-to-manage relationship management microservice. AWS Neptune was released in 2018 and is a fully managed graph database. It is also very fast and reliable. First, we built a proof-of-concept which we subsequently refined into a production worthy system.
In the picture below you can see a sneak preview of how the graph database looks like.
At the end of 2018, AWS released a new feature in CloudWatch called CloudWatch log insights. With this new feature, it is possible to query your logs to extract certain information. We compared this feature with our Elasticsearch solution to see if it could be replaced. Functionally, it is effortless to replace Elasticsearch with this, you could just use the Lambda function which previously sent all the data to Elasticsearch and let it aggregate the logging data in its log group. This way you can use CloudWatch log insights to query the log group containing the aggregated data. In terms of pricing, CloudWatch log insights uses a pay-as-you-use model while you have to pay hourly for running a managed Elasticsearch cluster. In short, when you use CloudWatch log insights the price decreases, but the usability also decreases because of the pricing formula. You have to pay for each query and the price of each query consists of a static amount multiplied with the amount of GB scanned. This means that the price increases if you query over a longer period. If we are using CloudWatch log insights you are using a pay-as-you-use model. It can be both cheap or expensive, it depends on your use-case.
After one month of running in production, we reflected on the performance of the system. Our first month of production saw quite an intensive workload as you can see in the following image.
We processed around 22 million Lambda invocations. However, Lambda scaled flawlessly and processed all these batched requests with an average duration of 200ms per request. Also, these invocations contain batches of data, so it’s not only a single product update in a single invocation. In this application, we used multiple Lambda functions which are all doing data transformations. Our Lambda functions combined with DynamoDB- and Kinesis streams allowed us to use a pay-as-you-use model and still have a very scalable and fast system. You’re probably thinking right now “what would 22 million Lambda invocations cost?” We were surprised, it only cost around 250 dollars. This shows us how nice it is to only pay for what you use while having a really fast system. Because AWS adds and improves services continuously, we sometimes redesign parts of the system. This way, we can often increase the performance and decrease the costs of the system.
We’ve explained how we dealt with most of the challenges by using a serverless architecture but also the ones that followed as a result of this decision I hope that you’ve learned something from these challenges and that you become a serverless-believer, like us!
Don’t hesitate to get in touch and let’s talk!