Cost per Tenant - Lessons learned

2024-02-24

Table of Contents

When supporting ISVs there is one question that keeps repeating- How can we calculate the cost per tenant in our system. And so I started a journey years back to find the secret to cost per tenant. I wanted to share the lessons I learned so far.

Why do we need it?

First of all it is important to understand the end goal- Why do you want this information. And the answer is different for different customers. But the different motives might yield different solutions to address this issue.

Capacity planning

Some companies are interested in understanding the cost per tenant for capacity planning, so they can understand the impact of adding tenants or removing tenants.

Charge back

Some companies have an internal charge back mechanism led by IT. So that the usage costs comes out of each department’s budget.

Back to back pricing strategy

Many ISV start by planning the pricing based on usage and want to do it as accurately as possible. There are several issues with this approach:

  • There are cost anomalies- that I will cover on lesson 3.
  • Leaving a small margin might come to haunt you later on- for example in the patch of Intel Meltdown in 2018. It caused performance issues that forced software companies to launch ~30% more compute power and increased the cost instantly, causing them loss of profits and even debt.

But that is beside the point- let’s go back to cost per tenant and the lessons learned:

Lesson 0 - Silos are easier

I call it Lesson 0 because as much as ot was obvious for me, it seems that not everyone know this- If you have a single tenant system- meaning that each of your customers is running on a separate infrastructure- It’s very easy to understand the cost per tenant. But we build multi-tenant SaaS for a more efficient architecture.

There are different levels of separations:

  1. Completely separated - Each customer runs in a separate account. In this case you can understand the cost per tenant using Cost Explorer or Cost and Usage reports.

  2. Semi-separated - Some of the infrastructure is separated and some is joined, or separated but running in the same account. In this case, the cost per tenant can be found using tags and visualized in Cost Explorer or CUDOS

  3. Fully multi-tenant - All infrastructure is shared by the customers. In this case- stay tuned.

Lesson 1 - We only think we know the value

At some point someone will come up with the following statement (phrasing may vary): “Cost per tenant is easy- all we need to do is get the ratio of the tenant’s requests from the load balancer and divide the costs accordingly”

So why isn’t is that simple?

Let’s take an example of a software company (hypothetical) that is transcoding video and streams them for their end customers (B2B).

They analyzed their customers costs and fond the following usage patterns:

  • Tenant 1 was using heavy transcoding on short videos
  • Tenant 2 was using simple transcoding for very short videos
  • Tenant 3 was hardly using their system but was storing massive videos on their platform for the past several years.

So their usage looked a bit like this: Cost per tenant

If they had tries to get the costs by using the ratio of requests, they could get a very different picture of the actual cost per tenant. So lesson 1- Understand the usage patterns of your customers. It might surprise you.

Lesson 2 - There are different types of costs

In order to understand costs we need to understand the different types of cost.

  • Fixed - These costs are not affected by the number of tenants (e.g. EKS control plane or EMR master node…) no matter how many tenants would be added or removed- these costs would not change

  • Tiered – Tenants usage increases the costs in steps (e.g. EC2, EKS, RDS), but only if they scale based on usage. For example- on one system all the customers are managed by the same RDS, and that RDS is not increasing read replicas or changing instance type. It remains exactly the same. On a different system, each customer requires adding a read replica, or sharding the data across separate database clusters. In the former example, the RDS is considered fixed cost, while in the latter- Tiered cost.

  • Linear- Costs are correlated directly to tenant usage (e.g. Lambda, Networking, DDB, any serverless service)

Costs Types

Understanding how different customers impact the different services would help us understand the cost per tenant better.

Lesson 3 – Cost anomalies

To understand cost anomalies lets take a simple architecture of a system with 2 EC2. Each EC2 can handle 25 requests. We will assume all requests are identical for simplicity. Each EC2 costs $5. When we get to 51 requests we must launch another EC2 for the load.

Architecture

And in this system we have to tenants.

  • Tenant 2 is sending 30 requests every hour
  • Tenant 1 is starting with 15 requests in the first hour, then 20 requests in the second hour, 21 requests in the third hour and 25 requests in the forth hour.

Requests per tenant

In the third hour the entire system crossed the 50 requests threshold and a 3rd EC2 instance is launched.

What would the cost per tenant look like?

Cost per tenant

So let’s look at it from Tenant 1’s perspective- On the third hour, they only increased the usage by 1 request (from 20 to 21 requests) and their cost jumped from $4 to $6.18.

From Tenant2’s perspective- they are sending exactly the same number of requests, and the cost seems to change for no apparent reason.

Why does it happen?

Because of the instances utilization. Capacity

And in case this was not clear- ANY tiered cost would have that issue

With Fixed costs it is worse, but it is more expected and obvious. Lets take for example an MSK that costs $100 per month, and never scales. We can then consider it fixed costs. These are the tenants’ usage patterns: Requests

And this would be the cost:

Cost

What can be done?

First of all- reduce the impact of the anomalies by reducing the tiers- move to smaller instances, move to containers or serverless. The more efficient your system is, the less impact you get in cost anomalies.

Or- as a different approach- Factor-in the utilization.

Let’s consider again the requests per tenant and the utilization:

Requests per tenant

The capacity is 25 requests per EC2. Therefore the cost based on $5 per EC2 is $0.2 per request. So if we reduce the utilization we get:

Cost per tenant

So why can’t we just do that?

It seems so simple now. Why can’t we just calculate the cost per tenant like that? A few reasons: First, it requires to know the capacity of each EC2. We assumed we know how many requests an EC2 can take. That is not always the case. Also, not all requests are created equal.What is the capacity for different types of requests? How can we tell? We assumed here that all the requests require the same compute resources, but as we learned in lesson 1- that is not always the case.

Who would take the hit for the unutilized resources? In some cases, the resources are highly underutilized. In some cases the services were selected for reasons other than cost, such as legacy constraints or knowledge. Imagine an MSK cluster that is hardly utilized - how would you absorb the cost of the cluster if the tenant utilization is low and it is mostly underutilized?

Lesson 4 – Focus on the value

It is important to understand that pursuing accuracy means investment of effort, therefore it is important to understand the value we would get and verify it is worth the effort.

To understand the different level of accuracy see below: Accuracy vs Effort

  1. Estimation by product

This means calculating the cost per tenant based on the ration of requests in the systems entry point- The Load balancer or API Gateway. We mentioned before why this is not accurate, as some requests can be more costly than others, and in some cases there are costs that are not dependant on requests at all such as Storage. However, in some systems this can be a “Good enough” solution for understanding the cost per tenant. Product

  1. Estimation by microservice

Similar to Estimation by product, but in this case, the calculation is based on the ratio of usage for each misroservice in the system. microservices

  1. Estimation by AWS Service In this case the estimation is for each AWS service in the system. For containers we see customers allocating containers based on the tenant’s namespace and calculate the cost. AWS Service

  2. Estimation by AWS Service usage type In some cases different usage requires understanding up to the granularity of AWS usage type. For example, one tenant may be using large files on S3 causing storage costs, while another is using small storage but uses many requests to S3.

AWS Service usage type

In any case, note that before diving into the most accurate estimation, the effort is relevant to your business case. Imagine you have a restaurant, and you are trying to understand the cost per customer. It makes sense to charge the customers based on the dishes they ordered, but not so much to track each customer’s temperature to check the impact of each customer on the AC system…

Make sure you are measuring actual business value.

Lesson 5 – Find your anchors

Now, after we understand the different factors that come into place. Lets understand how to measure the cost per tenant in different architecture components:

Compute

As mentioned before, different requests types can lead to different compute resources consumptions. So it is important to understand what types of usage patterns your customers are using.

  • Map different types

Group requests types based on their resource consumption. for example- simple/complex transcoding requests, Long/short sessions, etc.

  • Test test test

You need to understand your systems capacity. once you have mapped the different request types, how many requests of each type can your compute layer support before it needs to scale? I once had a manager of the load testing team ask me- “If the cloud is elastic, what is the purpose of load testing?” - This is it- to understand the load impact on costs.

  • Use logs whenever possible

Identify the tenants activity in the logs for easy querying.

  • Trace users journey

Use tools like AWS X-ray to trace the users journey in the system and understand different request types

Compute - Containers

  • If possible - use separate tenants with different namespaces, as long as it it does not impact the efficiency.
  • Test capacity based on different request types
  • Monitor nodes utilization
  • Use tools such as Kubecost/ OpenCost

Databases

Before calculating the Database cost per tenant, first check how does it scale per tenant. If it doesn’t, what would be the reason for calculating the cost per tenant? Is the DB shared? Are the tables shared? Is the question around storage or compute? Do the table sizes vary?

In most cases I found that the effort of understanding the Database cost per tenant does not have much business value in a shared database.

Having said that, there can be value in understanding the different queries sent to the DB based on the users activity. You can use X-ray for query sampling and get a better understanding of DB sessions that are in the 99th percentile in terms of latency. This would give you better insight to user activity.

Analytics

Glue, Athena, Redshift Spectrum all access S3 and can be identified in S3 access logs using their user agent.

S3 access logs allows to query and get statistics on the requests per user agent:

The requests to S3 grouped by user agent:

SELECT “bucket_name”, “useragent”, count(*) 
FROM s3_access_logs_db.mybucket_logs
group by 1,2;

The requests to S3 grouped by user agent and S3 folder:

SELECT “bucket_name”, “useragent”, substr(key,1,strpos(key, ‘/’) ), count(*) 
FROM s3_access_logs_db.mybucket_logs group by 1,2,3;

The requests to S3 grouped by hour, user agent, bucket and S3 folder:

SELECT date_format(parse_datetime(requestdatetime,'dd/MMM/yyyy:HH:mm:ss Z'), '%Y/%m/%d %H') , bucket_name, substr(useragent,1,strpos(useragent, '/') ) , count(*)FROM mybucket_logsgroup by 1,2,3 order by 1

Networking

Transit Gateway can be analyzed using VPC flow logs.

When in doubt

If you were not able to find other anchors- use the application logs. You can use CloudWatch logs insights to query them easily. application logs

Conclusions

To summarize:

  • Lesson 0 - Silos are easier, but that’s not why you build SaaS
  • Lesson 1 - We only think we know the value- Get a better understanding of your customer’s activity and the value they get from the system.
  • Lesson 2 - There are different types of costs- It is important to understand how scaling impact your costs before attempting to calculate cost per tenant.
  • Lesson 3 – Cost anomalies - verify you understand them before communicating the cost to your customers.
  • Lesson 4 – Focus on the value. Chasing accuracy might be counterproductive. verify you are calculating only the required metrics.
  • Lesson 5 – Find your anchors, but when in doubt- refer to application logs.

I hope this helps. in any case, feel free to contact with any question :)