Any pair of programmers can write some code in a garage, but once that code ships to real users you have a product, and that’s a different thing entirely.
No matter if you’re a software vendor or a packaging manufacturer building software to support your business, that software needs support, change management, hosting, integrations and documentation. “Just build it!” is often too easily said. Once it is built, you will have that software in your IT landscape for years to come.
Hiring a product owner will help you with the following things
Setting a vision your product should achieve
Drive change in the product with a team of developers
Collect requirements from users and stakeholders
Help users and stakeholders understand your product’s brilliance
Maybe you don’t need a product owner for every VBA script written in Excel, but any system with sufficient amount of users should have a product owner.
Here are some of the qualities I find important in a product owner
An excellent communicator to gather requirements and communicate plans
An ambassador that will make people interested in your product
Comfortable with drawing up plans and executing on them
A source of great values from where the team can inherit their culture
An internal marketer to make sure the product has continued funding
The product owner doesn’t need to be a tech wizard. Its much more important to get a good in-house marketer for your product.
Shit happens, it is inevitable. We work so hard to keep things running, with redundancy, automatic fail-over, 99.999% availability, but most of the time outages happen because someone screwed up.
In an unhealthy organization you hang that person and move on. The organization learns nothing and is doomed to repeat the mistake.
In an healthy organization the system is at fault for allowing the person to make the mistake. The system needs to be fixed and each outage is an excellent learning opportunity.
Having a playbook of what to do in an event of an outage is basic. You need to determine what kind of outage is considered an incident, how to discover an incident and how to collect the response team. One thing most teams forget, is that the playbook is useless if
Nobody knows it exists, or where to find it during an incident
This is why it’s imperative to have fire drills and to practice incidents. Some go as far as actually bringing down a system, to practice a live incident.
Here’s how I would plan a fire drill
Set a fixed time and date for the drill and inform the team so they can prepare
Schedule a service window during the fire drill so the organization and its users can prepare
Book a session with the team to present the incident playbook and make sure they know it
Break the system at the start of the service window. Automatically restore the system at the end of the service window if the team has failed to find the fault
Book a postmortem to evaluate the incident response
After an incident you should always conduct a postmortem. The point is to identify the root cause of the incident, find new systems, solutions, processes, routines to make sure the incident doesn’t reoccur.
The purpose is to create a learning organization, where you setup safe-guards for reoccurrence, which protection will remain long after the people involved in the incident are gone.
Things to consider with a postmortem
Putting blame on a person or a team, doesn’t prevent the incident to reoccur
Taking responsibility for the incident also won’t prevent it from happening again
The actions coming out from the retrospective meeting, must prevent the incident from happing again, or you have failed to identify the root cause
I have during January spent a lot of time thinking about, reading about and setting up a Service Level Agreement. The purpose is to agree on measurable metrics like uptime, responsiveness and responsibilities with your paying clients.
If it’s done right, it will influence how those clients prefer to interface to you. If they do it synchronously, asynchronously, put a cache in-between or have a failsafe.
Here I will write some general insights that I got from this process. If you want my complete SLA convention, you should check out my wiki. There I’ve also posted a sample SLA that you can reuse for your own purposes.
Always Start with Metrics
Before you dig into availability and 99.99999% you must start with metrics. What does availability mean to you? How do you measure it? What is an error? Is http status 404 an error? Does errors during maintenance count towards your metric? How is request latency measured? Is it measured on the client or the server? Do you measure the average on all the requests? How does a cold start latency affect your metric?
There are a lot of things to unpack before you can start thinking about objectives.
Not as Available as you Think
Everywhere you look businesses offer a 99,95% availability. Translated, it means 5 minutes and 2 seconds downtime weekly. A common misconception from developers is that it’s easy – All our deploys are automated anyway and if one fails, we’ll just rollback.
Before you set that objective you should consider
When the service goes down in the middle of the night, how much time does it take to wake somebody up to take look at the problem?
When the service goes down Saturday morning, do you have people working through the weekend to get the service up and running again?
Set an objective that promises availability within business hours, when you have developers awake that can work on the problem.
Pay people to be on-call when you need to offer availability off-hours.
Multiply availability of your dependent services with each other, and then with your own availability to reach a reasonable number. And then give yourself some slack. An objective should not be impossible or even challenging.
Azure Kubernetes = 99.95%
Azure MySQL = 99.9%
Azure API Management = 99.95%
My availability = 99%
Total Availability = 99.95% * 99.9% * 99.95% * 99% = 98.8%
Every Metric must be Measured
This sound so obvious, how can you know that you meet the objective unless you measure the metric? Still I rarely see anyone measuring their service level indicators. Maybe they don’t want to know.
If you are using a cloud provider like Microsoft Azure, you can setup workbooks to measure your metrics. I’m a proponent of giving my clients access to these workbooks so they can see that we live up to the SLA.
The Client Also have Responsibilities
An agreement goes both ways, and in order for you as a vendor to fulfil your part of the agreement you need to put some requirements on the client.
Define a reasonable workload that the client is allowed to put on your service for the objectives to be obtainable. You can set a limit of 100 requests/second and refuse excess requests. Those errors do not count towards your error budget.
The client should be responsible for adjusting their service clients to updates in your API. You don’t want to maintain a 5 year old version of your system.
Reparations should Repair not Bankrupt
I’ve seen so many service license agreements that include a fine if the objectives are not met, and often those fines are quite high. They seldom define how often a client can request a payout, and together with badly defined objectives, a client could drive a service provider into bankruptcy.
That is not beneficial to anyone, so please stop writing SLAs with harsh penalties. You should try to repair and not bankrupt
How much damage was caused by the outage?
Can we update the service level objectives to become more reasonable?
Can the client adjust their use of our service to better fit our new objectives?
Is the client open to paying more so we can have a service technician on-call?
Writing an SLA is hard. It requires experience from both the legal team and IT operations. Availability is not an objective that a client can demand of your service. It must be negotiated and carefully weighed between IT operations environment, support organization and costs.