Developing Solutions for Microsoft Azure

Today I passed my AZ-204: Developing Solutions for Microsoft Azure exam and became an Azure Developer Associate. I’ve done some certifications in my days, but this was by far the hardest. The breadth of the knowledge required, Azure SDKs, data storage, data connections, APIs, authentication, authorisation, compute, containers, deployment performance and monitoring – combined with the extreme details in the questions, made this really hard. I didn’t think that I passed until I got my result.

These were the kind of questions that were asked

  • Case studies: Read up on a case study and answer questions on how to solve the client’s particular problems with Azure services. Questions like, what storage technology is appropriate, what service tier should you recommend, and such.
  • Many questions about the capabilities of different services. Like, what event passing service should you use if you need guaranteed FIFO (first-in, first-out)
  • How to setup a particular scenario. Like what order you should create services in order to solve the problem at hand. Some of these questions where down to CLI commands, so make sure that you’ve dipped your toes into Azure CLI.
  • Code questions where you need to fill in the blanks on how to connect and send messages on a service bus, or provision a set of services with an ARM template. You also get code questions where you should answer questions about the result of the code.

Because of the huge area of expertise and the extreme details of the questions, I don’t think you could study and pass the exam without hands-on development experience. If I were to give advice on what to study it would be

  • Go through the Online – Free preparation material. Make sure you remember the capabilities of each service, how they differentiate, and what features higher pricing tiers enables. Those questions are guaranteed.
  • Do some exercises on connecting Azure Functions, blob storage, service bus, queue storage, event grid and event hub. These were central in the exam.
  • Make sure you know how to manage authorisation to services like blob storage and the benefits of the different ways to do it. Know your Azure KeyVault as the security questions emphasise on this.

Be prepared that it is much harder than AZ-900: Microsoft Azure Fundamentals, go slow and use up all the time that you get. Good Luck!

Product Ownership

Any pair of programmers can write some code in a garage, but once that code ships to real users you have a product, and that’s a different thing entirely.

No matter if you’re a software vendor or a packaging manufacturer building software to support your business, that software needs support, change management, hosting, integrations and documentation. “Just build it!” is often too easily said. Once it is built, you will have that software in your IT landscape for years to come.

Hiring a product owner will help you with the following things

  • Setting a vision your product should achieve
  • Drive change in the product with a team of developers
  • Collect requirements from users and stakeholders
  • Help users and stakeholders understand your product’s brilliance

Maybe you don’t need a product owner for every VBA script written in Excel, but any system with sufficient amount of users should have a product owner.

Here are some of the qualities I find important in a product owner

  • An excellent communicator to gather requirements and communicate plans
  • An ambassador that will make people interested in your product
  • Comfortable with drawing up plans and executing on them
  • A source of great values from where the team can inherit their culture
  • An internal marketer to make sure the product has continued funding

The product owner doesn’t need to be a tech wizard. Its much more important to get a good in-house marketer for your product.

Responding to Incidents

Shit happens, it is inevitable. We work so hard to keep things running, with redundancy, automatic fail-over, 99.999% availability, but most of the time outages happen because someone screwed up.

In an unhealthy organization you hang that person and move on. The organization learns nothing and is doomed to repeat the mistake.

In an healthy organization the system is at fault for allowing the person to make the mistake. The system needs to be fixed and each outage is an excellent learning opportunity.

Incident Playbook

Having a playbook of what to do in an event of an outage is basic. You need to determine what kind of outage is considered an incident, how to discover an incident and how to collect the response team. One thing most teams forget, is that the playbook is useless if

  • Nobody knows it exists, or where to find it during an incident

This is why it’s imperative to have fire drills and to practice incidents. Some go as far as actually bringing down a system, to practice a live incident.

Here’s how I would plan a fire drill

  1. Set a fixed time and date for the drill and inform the team so they can prepare
  2. Schedule a service window during the fire drill so the organization and its users can prepare
  3. Book a session with the team to present the incident playbook and make sure they know it
  4. Break the system at the start of the service window. Automatically restore the system at the end of the service window if the team has failed to find the fault
  5. Book a postmortem to evaluate the incident response

Postmortem

After an incident you should always conduct a postmortem. The point is to identify the root cause of the incident, find new systems, solutions, processes, routines to make sure the incident doesn’t reoccur.

The purpose is to create a learning organization, where you setup safe-guards for reoccurrence, which protection will remain long after the people involved in the incident are gone.

Things to consider with a postmortem

  • Putting blame on a person or a team, doesn’t prevent the incident to reoccur
  • Taking responsibility for the incident also won’t prevent it from happening again
  • The actions coming out from the retrospective meeting, must prevent the incident from happing again, or you have failed to identify the root cause

Here’s my template for postmortem retrospective to help you ask the right questions to identify the root causes.

Document your Code

I was told this week the code doesn’t need documentation because the developers are good at naming things. So I thought it was time to revisit what kind of documentation should be included in code.

Code Comments

There are 2 common objections to code comments

  1. They are not very useful because the code tells us what the program does
  2. They are often wrong because the code changes but not the comments

This is just the talk of lazy “low effort” developers. I think the agile manifesto “working software over comprehensive documentation” has done more harm than good.

Well written comments are invaluable. I’ve never come across an outdated comment that threw me off in a way that I couldn’t just delete it. 🤷‍♂️

Here are some examples of code comments I find useful

1. Adding context that is not in the code

This code was written because a behaviour in macOS.

// On macOS it's common to re-create a window in the app when the
// dock icon is clicked and there are no other windows open.
if (BrowserWindow.getAllWindows().length === 0) createWindow();

2. Adding intention to the code

There are some things that only work in this order.

// This method will be called when Electron has finished
// initialization and is ready to create browser windows.
// Some APIs can only be used after this event occurs.
app.whenReady().then(() => {
  createWindow();
});

3. Rabbit holes you went down and want to warn others of

Warning, here be dragons. 🐉

// THE OBJECT POLYFILL WILL NOT WORK ON THE WEBKIT 1.0.3 PLATFORM
// import "core-js/es/object";

4. Explaining what is going on that the code doesn’t communicate clearly

Why must public url be the same as window location?

if (publicUrl.origin !== window.location.origin) {
  // Our service worker won't work if PUBLIC_URL is on a different origin
  // from what our page is served on. This might happen if a CDN is used to
  // serve assets; see https://github.com/facebook/create-react-app/issues/2374
  return;
}

5. Add a reference to the bug or issue that prompted the change

Go check the bug description to find more information why the code looks like this.

Sentry.init({
  // BUG AB#3133 Decrease sample rate in production
  // Decreasing sample rate to keep costs down.
  tracesSampleRate: 0.1,
});

6. Description of public modules and functions

In order to get nice intellisense when using this module or function from elsewhere in the code.

/**
 * A button that let's you copy the current value to clipboard.
 *
 * @param {object} props
 * @param {string} props.text - The text to display on the button.
 * @param {string} props.value - The value to copy to clipboard.
 * @param {boolean} [props.isDisabled] - Whether the button should be disabled.
 */
function CopyButton({ text, value, isDisabled = false }) {
}

7. Source of Information

Not going to explain all this crap here. Go read up!

/**
 * The source for these abbreviations is here.
 * https://docs.microsoft.com/en-us/azure/cloud-adoption-framework/ready/azure-best-practices/resource-abbreviations
 */
 let abbreviations = ["aks", "appcs", "ase", "plan", "appi", "apim", ....];

8. Source of Copy/Pasted Code

(we all do it sometimes)

// source: https://stackoverflow.com/a/15289883
function dateDiffInDays(a, b) {
  // Discard the time and time-zone information.
  const utc1 = Date.UTC(a.getFullYear(), a.getMonth(), a.getDate());
  const utc2 = Date.UTC(b.getFullYear(), b.getMonth(), b.getDate());

  return Math.floor((utc2 - utc1) / _MS_PER_DAY);
}

9. In order to understand this code you’ll need to know more about this special topic

We are not making up the rules, they are!

// Official DCC Schema documentation
// https://github.com/ehn-dcc-development/ehn-dcc-schema
function parseDccSchema(dcc) {
}

10. What kind of result you can expect from a module or function

/**
 * Calculator screen. It is divided into a left and right part, where the left part 
 * is the input form and the right part presents the result. If the screen width is
 * less than 768px the left part becomes top and the right part becomes the bottom.
 */
function Calculator() {
  /** implementation.. */
}

Summary

Anyone can write code that computers understand. The challenge is writing code that also humans understand.

If you want to know more about how I document code, check out the convention on my wiki.

Insights on setting up a Service Level Agreement

I have during January spent a lot of time thinking about, reading about and setting up a Service Level Agreement. The purpose is to agree on measurable metrics like uptime, responsiveness and responsibilities with your paying clients.

If it’s done right, it will influence how those clients prefer to interface to you. If they do it synchronously, asynchronously, put a cache in-between or have a failsafe.

Here I will write some general insights that I got from this process. If you want my complete SLA convention, you should check out my wiki. There I’ve also posted a sample SLA that you can reuse for your own purposes.

Always Start with Metrics

Before you dig into availability and 99.99999% you must start with metrics. What does availability mean to you? How do you measure it? What is an error? Is http status 404 an error? Does errors during maintenance count towards your metric? How is request latency measured? Is it measured on the client or the server? Do you measure the average on all the requests? How does a cold start latency affect your metric?

There are a lot of things to unpack before you can start thinking about objectives.

Should an 8 second cold start in the middle of the night affect you reaching your SLA objectives?

Not as Available as you Think

Everywhere you look businesses offer a 99,95% availability. Translated, it means 5 minutes and 2 seconds downtime weekly. A common misconception from developers is that it’s easy – All our deploys are automated anyway and if one fails, we’ll just rollback.

Before you set that objective you should consider

  • When the service goes down in the middle of the night, how much time does it take to wake somebody up to take look at the problem?
  • When the service goes down Saturday morning, do you have people working through the weekend to get the service up and running again?
  • Your availability is dependent on the availability of all the services you depend on. If you host on Azure Kubernetes which offers 99,95% availability, you cannot offer the same because Microsoft will eat up your whole failure budget.

Be kind to yourself. Don’t overpromise

  • Set an objective that promises availability within business hours, when you have developers awake that can work on the problem.
  • Pay people to be on-call when you need to offer availability off-hours.
  • Multiply availability of your dependent services with each other, and then with your own availability to reach a reasonable number. And then give yourself some slack. An objective should not be impossible or even challenging.
Azure Kubernetes = 99.95%
Azure MySQL = 99.9%
Azure API Management = 99.95%
My availability = 99%

Total Availability = 99.95% * 99.9% * 99.95% * 99% = 98.8%

Every Metric must be Measured

This sound so obvious, how can you know that you meet the objective unless you measure the metric? Still I rarely see anyone measuring their service level indicators. Maybe they don’t want to know.

If you are using a cloud provider like Microsoft Azure, you can setup workbooks to measure your metrics. I’m a proponent of giving my clients access to these workbooks so they can see that we live up to the SLA.

A dashboard that is automatically updated with the metrics from our service license agreement.

The Client Also have Responsibilities

An agreement goes both ways, and in order for you as a vendor to fulfil your part of the agreement you need to put some requirements on the client.

  • Define a reasonable workload that the client is allowed to put on your service for the objectives to be obtainable. You can set a limit of 100 requests/second and refuse excess requests. Those errors do not count towards your error budget.
  • The client should be responsible for adjusting their service clients to updates in your API. You don’t want to maintain a 5 year old version of your system.

Reparations should Repair not Bankrupt

I’ve seen so many service license agreements that include a fine if the objectives are not met, and often those fines are quite high. They seldom define how often a client can request a payout, and together with badly defined objectives, a client could drive a service provider into bankruptcy.

That is not beneficial to anyone, so please stop writing SLAs with harsh penalties. You should try to repair and not bankrupt

  • How much damage was caused by the outage?
  • Can we update the service level objectives to become more reasonable?
  • Can the client adjust their use of our service to better fit our new objectives?
  • Is the client open to paying more so we can have a service technician on-call?

Summary

Writing an SLA is hard. It requires experience from both the legal team and IT operations. Availability is not an objective that a client can demand of your service. It must be negotiated and carefully weighed between IT operations environment, support organization and costs.

Taking Control of Azure Access Control

This is another post in the unintended series about untangling your Azure account. My first post was about naming and grouping your Azure Resources. The second was about writing conventions and following them. This third post is about managing Azure Access Control.

Developers, Developers, Developers

There’s nothing inherently wrong with handing out developer access to each resource and resource group they need. You will have a mess of access rights spread all over, but you will easily revoke access by removing developers from the subscription.

If you need to manage privileges in a structured way, it is less than ideal. That is why I have developed a convention for managing access in our Azure subscription. It’s quite easy.

For each resource group, create one user group with contributor role, and use the following name format.

The name format of a user group.

Let’s break it down

  • Project Name and Component Name should be exactly the same as the resource group name.
  • Environment, I usually go with dev and prod. I have never come across a situation where I needed to hand out access specifically to test or stage. So dev means dev & test where prod means stage & prod.
  • Contributor is useful to have if you need to hand out access for more roles later. For me, the most common access role after contributor has been monitoring.
  • UG is the user group suffix, which helps you deal with these in Azure API scenarios.

There will be one user group for every resource group.

Assigning Access

You can now assign access to the user groups instead of the Azure resources directly.

Assigning user access to user groups instead of direct access to resource groups.

Managing access will become much easier.

Groups of Groups

Doing this will unlock the potential of combining user groups into larger user groups. If the project “Klabbet” has both a web and api component, we can create a user group that will give developers access to both.

User GroupMember OfComment
klabbet-dev-contributor-ugklabbet-web-dev-contributor-ug
klabbet-api-dev-contributor-ug
API and Web dev access.
klabbet-prod-contributor-ugklabbet-web-prod-contributor-ug
klabbet-api-prod-contributor-ug
API and Web prod access.
We can combine user groups into more permissive user groups.

By combining user groups into larger user groups we will get better control of what kind of access a user has, without investing too much effort.

One user group assignment will give the user access to 4 resource groups.

Summary

I’ve presented you a format for access control that does not require much effort to setup, but provides lots of flexibility to take control of your access control.

If you’re interested in my convention for access control you can find the specification here.

Project Conventions are Crucial

In my previous blog post I wrote about the importance of having a convention for grouping and naming resources in Azure. In this article I will explain how to setup conventions for your project. Writing a convention is easy. Making sure it is followed is much harder.

Purpose of Conventions

We write conventions for our projects of the following two reasons

  1. If everyone do things their own way we’ll have a mess. Messes are hard to maintain.
  2. Jotting it down saves a lot of time when we introduce new developers to the team.

Don’t write a lot of conventions up front, but also don’t wait until the absence of conventions turn into a mess.

The Art of Writing Conventions

Writing conventions is the easy part. Here is a sample of how I do it.

My convention or naming resources on Azure.

Don’t feel obliged to add all the bells and whistles if you don’t need them. Here is what I do

  • Versioning, makes governance much easier. A simple document version table at the top will do.
  • Each statement is short and precise. Don’t give room for interpretations.
  • I use RFC2119 to make it clear what’s a rule (MUST), an injunction (SHOULD) or a suggestion (MAY).
  • One statement per line makes it easier to skim through.
  • Numbering each statement will make it easier to reference individual statements. Anchor links are nice.
  • Link to external resources for further reading.
  • You don’t need to justify a statement. Leave it for team discussion.

Keeping it Alive

Conventions that aren’t followed are useless.

Include project conventions in your peer review process.

Take 15 minutes each week (after daily on Fridays) with the team to discuss each convention. Update the convention during the meeting. This will make the whole team aware of the convention and the conventions will be kept updated. If you have 26 conventions you will go through them all every 6 months.

I find these sessions very valuable for the team, and if you replace people often in your team they are a necessity.

Moving on From Here

I’ve posted my conventions from Bring Order to your Azure Account publicly with Creative Commons license. Go ahead and steal those conventions for your own project wiki and update them to fit your team.

Bring Order to your Azure Account

I’ve had the benefit of stepping in to other developers’ Azure accounts, and it’s very much like opening up someones brain and taking a peek inside.

There are some bits ordered here and there, but it’s mostly chaos.

If it’s setup by someone with experience you will notice an intention of following conventions because they’ve seen first hand how quickly it gets out of hand. When there are more than one contributor, you will start noticing different patterns depending on which person set it up.

That is why it is so important to document a convention.

Why is it like this?

  • Most Azure resources are hard to move and rename. If you do something wrong you will need to delete and recreate the resource which often isn’t worth the trouble.

Here follows my convention. If you don’t like it, feel free to create your own. The most important thing is not how resources are named, but that the convention is followed so they’re named the same way.

Subscriptions

Microsoft will suggest you to put production and non-production resources into different subscriptions, in order to take advantage of Azure Dev/Test offer. I’m not in favour of this approach.

I want you to think about the subscription as “who gets the invoice?”. Not which department is going to bear the cost, but the actual invoice. Depending on your organization structure you need 1 subscription, or you’ll need 10.

It is not uncommon to have one financial department that will receive the invoice and split the costs on different cost units. It is also not uncommon to have 4 different legal entities and invoicing them separately.

If you as a consultant are doing work for a client that doesn’t have their own subscription, create a subscription for them. You will have all the costs neatly gathered, and you can easily hand it over when the project is done.

I have two subscriptions, one for each of my companies.

Resource Groups

I structure my resource groups like this.

Let’s unpack this a bit

  • Project Name, is the name of the application, project, website. I place this first, because when I’m searching for a resource group I will always start looking for the name of the application.
  • Component Name, is optional. Sometimes you don’t need it, but most of the time there are several components in a project. A web, an API, a BI component.
  • Environment, will be dev, test, stage or prod. I place different environments in different resource groups to make it easier to tear down one environment and rebuild another. It helps with keeping environments isolated and access control on the environment level.
  • Rg, to make it easier to recognise a resource group in az cli and other API scenarios.

A note on isolating environments. It drive cost up when you won’t share app service plans between dev and test. In my experience dev/test resources are cheap and it doesn’t cost much to duplicate them. The benefit of completely isolated environments is greater than the cost of split resources.

In my small Klabbet subscription it looks like this.

Sometimes Azure creates resource groups for you like AzureBackupRG_westeurope_1. Just leave it. It is not worth the trouble to change it for OCD purposes.

This scales well with hundreds of resource groups thanks to the filter functionality. Writing vaccinated in the filter box, will help me to quickly identify all resource groups belonging to that project.

Resource Names

Now we’ve made sure each resource group only contains resources for a specific application, component, environment. It will make the number of resources in each group much fewer and easier to overview.

I have a naming convention for azure resources as well. (looks very much like the previous one)

Project name, component name and environment in the resource name is useful for working with az cli and Azure APIs. The resource name abbreviation should be picked for the Microsoft recommended abbreviations. If your resource doesn’t have an official abbreviation, make your own and document it in your convention.

Don’t worry about Microsoft having special naming constraints, just remove the dashes for storage accounts.

Here’s what a resource group might look like.

Don’t worry about Azure creating resources that doesn’t follow your naming convention. That is fine.

Tagging

Tagging is often forgotten, even if you’re reminded every time you create a new resource. *hint* *hint* Tagging is extremely useful. Here are my standard tags that I put on all resources.

NameExampleComment
ApplicationVaccinatedBoss comes in and say: “I want to know exactly how much this project has cost us.” You filter cost analysis on the Application tag.
EnvironmentprodBeing able to split cost analysis on different environments is golden.
OrganizationKlabbetYou will always host only 1 organization within this subscription, until you’re not. When you’re going to create that first cost report, you’ll be glad that you did tag Organization from the start.

Other suggested tags “Business Unit”, “Component”, “Country” if you have an org with different departments, work on different components or is multi-national.

Tagging is for reporting, so make sure that you think of what your boss might come ask of you when you setup which tags are mandatory.

Now we can filter or group our expense reports on these tags.

Vaccinated was one failure of a project. How much should we blame marketing department for that!? Well 600 SEK worth!

Summary

This blog post has been going through my preferred way of getting order in my Azure account, making it easy to find things and tagging things to make reporting easier.

This might not be your preferred way, but it is has proven itself useful to me.

Rigor Mortis

Disclaimer.

This blog post is about the overuse of code quality methods. If you’re not using good practices for code quality, the advice against overuse doesn’t apply to you. Never stop doing what you never did, because that someone on the internet has overdone it and believe that they should not do it as much anymore.

What is Code Quality

Software Quality is a very wide concept, but when it comes to code quality I find it quite easy to pin down. The following aspects are often mentioned in regards to code quality

  • Low cyclomatic complexity
  • Easy to read
  • Easy to test
  • Can be reused
  • No side effects

These aspects comes down to making the code easy to change. That is why my definition of high quality code is, code that is easy to change.

Code Smell: Rigor Mortis

Code can be easy to read, well tested, documented, reviewed and still be hard to change. Code quality practices can work in a way that locks down the code and makes it harder to change.

If you strive for your code to be perfect, it will also become hard to change. Once you try to change it, tests will break, code quality tools will complain about loose ends and the compiler wants you to fix 20 compilation errors.

I call this code smell Rigor Mortis.

Too Much Quality Slows you Down

If you apply too much quality methods, the code will become harder to change. It will look great, but if it cannot be changed it is dead. Businesses that depends on software being easy to change, will be impeded by code too rigid to change.

Unit Testing

The practice of writing unit tests is a quick way to increase quality. Done the right way it will help you refactor a program and introduce changes while keeping track of unintended side effects.

Too many tests, or tests written in the wrong way, will make the program harder to change.

  • Tests fail when they have high coupling to the implementation of the system under test. Those tests are brittle.
  • Tests fail when you alter the behavior of a program. These tests were intended to fail.

Tests that fail require work to fix them. Each failing test makes it harder to change the code. Good tests only fail when you break your program, and all other tests makes your code smell like rigor mortis.

Static Typing

In compiled languages like C# and Typescript the compiler will check your program for errors. This provides quick feedback of syntax errors, spelling errors or logical errors. It helps you code faster.

Static Typing will also slow you down. The compiler can be so strict that you spend more time trying to satisfy it than making the desired change of your program.

Making a change in a statically typed program, may require you to update code and models in 10 different places to satisfy the compiler. If this helps you prevent errors, that is a good practice, but if it only slows you down it smells of rigor mortis. 

Linting & Static Code Analysis

Tools that automatically review your code may help greatly in avoiding common mistakes that could take hours to troubleshoot. They are very helpful in teaching us quirks of the language that we should watch out for.

Linting tools also works the other way around. It will complain and stop you from making unsafe changes that are needed in troubleshooting. Commenting out a line of code will lead you down a rabbit hole of making sure no references are unused, just to satisfy the linter.

Watch out when your linter makes your code harder to change. It smells like teen spirit, .. I mean rigor mortis.

Summary

I’ve introduced a code smell, that is the smell of too much quality, and I’ve given you some examples of when this smell applies.

All these are good code practices by today’s standards, and you should apply them to your projects. But you also need to beware so your quality methods doesn’t impede with your ability to change your code.

You don’t want your code to reach a state of rigor mortis.

New Blog

I’ve been writing about software on and off since 2008. My first blog was on WordPress. I hosted it myself somewhere and it was hell to keep updating it to avoid getting hacked.

I migrated my blog over to an Orchard CMS site that I developed myself. This was both an experiment to learn Orchard CMS which was new back then, and playing around with Azure. It turns out that I don’t have time to maintain an Orchard CMS site.

So I migrated the Orchard CMS site to Jekyll. I could deploy it to AWS S3 and it was so fast! Except, time is flowing by, the Ruby version gets out of date, the ImageMagick library gets old and soon I had to create a docker container just for writing and publishing new blog posts.

So now I’m back on WordPress. I will not migrate all the content as I’ve done before, but I will move over what I think still brings value, and the rest will evaporate into cyber space.