What ISVs should know - Part 3 - Operations

2023-09-23

A lot is changing for operations in Independent Software Vendors (ISVs) when moving to the cloud. So, I will focus on the issues that should be addressed as a first priority, and later on, I’ll probably write a post on operations models in the cloud.

Before we go into the operating models, I want to start with the basics: making the Operations team tenant-aware. Meaning that you would need to have a clear understanding of your operational status, risks, and impact in a tenant view.

While every organization that runs workloads needs to verify that the workload is running as expected and according to best practices, ISVs also need to view their operations through the prism of their customers.

When you work with customers, you also commit to a service level agreement (SLA) that defines your uptime percentage and uptime definition. The percentage reflects the amount of time your system can be down.

99% means the system can be down almost 15 minutes a day without breaching the SLA
99.9% means the system can be down almost 2 minutes a day without breaching the SLA
99.99% means the system can be down less than 4 minutes a month without breaching the SLA
99.999% means the system can be down around 5 minutes a year without breaching the SLA

The definition is what down means. Is it when over 50% of the users cannot log in? When they cannot perform main functionalities? Which functionalities? Are they the same for each of your customers?

What about ephemeral services like reporting, are those included in the SLA?

So now that we know what these are, let’s see how they impact our operations procedures.

Maintenance

Occasionally we have to install OS patches, upgrade the system, restart the instances, and so on.

Not all these activities necessarily cause downtime, but they should be confined to a specific maintenance window that is communicated to the customer.

"MaintenanceWindow": {
            "Type": "AWS::SSM::MaintenanceWindow",
            "Properties": {
                "Description": "Maintenance window for operation tasks of tenant1",
                "AllowUnassociatedTargets": false,
                "Cutoff": 2,
                "Schedule": "cron(0 6 * * sun)",
                "Duration": 6,
                "Name": "tenant1-maintenance-window"
            }

Now, if you have a multi-tenant system, all tenants must be aligned to the same maintenance window, and if you have single-tenant systems (The tenants are isolated), then you can create different maintenance windows and assign them to the relevant instances.

"MaintenanceWindowTarget": {
    "Type": "AWS::SSM::MaintenanceWindowTarget",
    "Properties": {
        "WindowId": "MaintenanceWindow",
        "ResourceType": "INSTANCE",
        "Targets": [
            {
                "Key": "tag:customer",
                "Values": [
                    "Tenant1"
                ]
            }
        ],
        "OwnerInformation": "CloudOps",
        "Name": "tenant1-patch-target",
        "Description": "A target for tenant1 instances maintenance"
    },
    "DependsOn": "MaintenanceWindow"
}

In case of an event, which customer was impacted? What was the impact per customer? Who should be notified? Is there a programmatic notification to the customer?

If there is a vendor event (Can be AWS or a 3rd party vendor), how long would it take you to map the impacted customers and notify them?

Do you have automation in place to troubleshoot an incident for one customer without risking other customers?

Do you have an elevated permissions solution for troubleshooting?

Observability

In order to verify the system is up and running for all tenants, you should have:

a dashboard to visualize the system status that would allow you to quickly identify if something is wrong and if so, give you a vague direction where to start your investigation (much like a dashboard of a car that allows you to identify fast if there is an issue, without overwhelming with alerts)
Metrics, logs, and traces that would allow you to identify both the root cause and the tenants impacted so you can quickly understand that 100% of the transactions of tenant 1 were impacted, 40% of the transactions of tenant 2, while the rest of the tenants were not impacted heavily.

To get this visibility, you need the tenants’ attributes in the metrics, the traces, and the logs.

You can find a code example for tenant observability here.

In addition, consider that in most cases, you would want to know if there is an issue with the system before customers are impacted. To do that you can use AWS CloudWatch Synthetics. It allows you to simulate customers’ activity and identify issues quickly.

Vendors, partners, and customers

Your system might be working well, but it relies on 3rd parties to provide service, and you need to verify all the dependencies are in full operations, and if that is not the case, to understand quickly what is the impact on your customers.

Some of the vendors may have mechanisms in place to inform you if there is an ongoing event. For those that don’t, you would need to create monitoring mechanisms to identify any issue.

AWS Provides alerts on operational events as well as abuse events. You can implement alerts using the template provided here and if you want a tenant-aware version (identifies which tenant was impacted and alerts accordingly), it is available here.

Now that you see how AWS notifies their customers, take the time to consider how you wish to notify your customers of an ongoing event or planned maintenance. And if not programmatically to your customers, maybe alert internally to the relevant account stakeholders.

Runbooks

Better to prepare for an event before it happens. And although you might not know what the issue might be, you can probably assume what would be the steps required to troubleshoot it. You would need to go over the logs, metrics, and traces, you might need to attempt to replicate it. If you are running in the cloud, those steps can be automated, and an identical environment can be created.

Create a runbook to replicate an environment, and use automation to copy data from Production while omitting PII. Systems manager has runbooks to restore RDS from snapshots, DMS can be used to omit the PII data regardless of the underlying Database platform. Having those ready would help you get ready for an issue.

In order to better anticipate issues and the team’s readiness, you can use AWS Fault Injection Simulator. You can find an example of use here.

Looking at this post, I know it seems a lot, but remember that Operations are an iterative effort that keeps evolving. I highly recommend looking into the Well-Architected tool for some pointers on Resiliency and Operational Excellence.

I intentionally ignored security related incidents. Those deserve their own blog post…