Installing Matrix on Azure App Service Plan

My previous blog post was talking about what Matrix is and why you would like to move from Slack and Teams to Matrix. This blog post will talk about my installation journey.

The normal path of installing Matrix is to deploy it on a Kubernetes cluster. This makes a lot of sense because you need 4-5 services in total to get it running. However, when I was looking up the Azure Kubernetes Service costs it would cost me about $100 per month and I was not willing to spend that much on this experiment. So I played with the idea of deploying on Azure App Service Plan instead.

High Level Architecture

Here’s a high level image of the components of a Matrix system.

Synapse

This is the home server and it’s the core Matrix service. This one is responsible for routing all messages to the correct recipient and such. All the core functionality is here. It’s also the server that you connect your clients to.

Element Web

This is a web client for Matrix. Think of it as Slack in the web browser. This one is not really needed for the system to work, but I found it useful to setup as a way to test the system. Later I used Element desktop client and Element X iOS client exclusively.

Matrix Authentication Service (MAS)

This is where you create users and authenticate. This is absolutely needed and Synapse and MAS need to be integrated with one another to work properly.

PostgreSQL

It is possible to run Matrix on a file database like SQLite, but I don’t think it’s viable for a production setup. I did setup my own PostgreSQL and connected it to Matrix. More on that down below.

Element Call

I didn’t expect to be using audio or video conferencing in my experiment so I didn’t setup Element Call, but I think this is essential in a production setup.

Ingress

I thought I could manage without a dedicated ingress, but it became weird because my home server name and the address became different. My home server name was matrix.klabbet.dev but the address was synapse.matrix.klabbet.dev. I think setting up a dedicated nginx for ingress would have made much difference and I don’t think it would have been particularly hard neither.

Azure Architecture

Deciding to deploy Matrix on Azure to avoid Microsoft training AI on your data, or to become independent from Microsoft dominance with Teams, might seem to contradict the purpose – but the purpose here was to explore Matrix. If the goal was to reduce dependency to Microsoft I would’ve chosen a different hosting option.

I used bicep to deploy resources for my Matrix setup to Azure. This is a visualization of my bicep project. It contains all the resources I deployed.

I will summarize the most important parts of the installation here. If you want the details you can find all the bicep scripts on my public repository on Github.

Virtual Network

First of all you need a virtual network to protect your storage account, key vault and the database. All communication from your app to these services must be private.

resource vnet 'Microsoft.Network/virtualNetworks@2025-01-01' = {
  name: 'vnet-klabbet-matrix-prod-001'
  location: resourceGroup().location
  tags: resourceGroup().tags
  properties: {
    addressSpace: {
      addressPrefixes: [
        '10.100.0.0/16'
      ]
    }
    subnets: [
      {
        name: 'snet-klabbet-matrix-prod-001'
        properties: {
          // 10.100.0.0 - 10.100.0.255
          addressPrefix: '10.100.0.0/24'
          networkSecurityGroup: {
            id: resourceId('Microsoft.Network/networkSecurityGroups', 'nsg-klabbet-matrix-prod-001')
          }
          serviceEndpoints: [
            {
              service: 'Microsoft.Storage'
            }
            {
              service: 'Microsoft.KeyVault'
            }
          ]
          delegations: [
            {
              name: 'Microsoft.Web.serverFarms'
              properties: {
                serviceName: 'Microsoft.Web/serverFarms'
              }
              type: 'Microsoft.Network/availableDelegations'
            }
          ]
        }
      }
      {
        name: 'snet-klabbet-matrixdb-prod-001'
        properties: {
          // 10.100.1.0 - 10.100.1.255
          addressPrefix: '10.100.1.0/24'
          networkSecurityGroup: {
            id: resourceId('Microsoft.Network/networkSecurityGroups', 'nsg-klabbet-matrix-prod-001')
          }
          delegations: [
            {
              name: 'Microsoft.DBforPostgreSQL.flexibleServers'
              properties: {
                serviceName: 'Microsoft.DBforPostgreSQL/flexibleServers'
              }
            }
          ]
        }
      }
    ]
  }
}

There are two subnets, one for the service, key vault and storage. The second subnet is for the database because it needs a delegation. It’s a bit overkill to use a /16 address space for the vnet and /24 for the subnets. You can without any trouble squeeze in everything into a much smaller address space. (just me being lazy)

You need to create a private DNS zone to link to your database.

resource privateDnsZone 'Microsoft.Network/privateDnsZones@2020-06-01' = {
  name: 'private.postgres.database.azure.com'
  location: 'global'
  tags: resourceGroup().tags
}

// Link the Private DNS Zone to your VNet
resource vnetLink 'Microsoft.Network/privateDnsZones/virtualNetworkLinks@2020-06-01' = {
  parent: privateDnsZone
  name: 'vnet-klabbet-matrix-prod-001-link'
  location: 'global'
  properties: {
    registrationEnabled: false
    virtualNetwork: {
      id: vnet.id
    }
  }
}

Storage Account

Synapse needs a storage account where it can store temporary files and media. I like Azure Storage Accounts because they’re so cheap. Here we create a file service for synapse data and MAS.

MAS will only use it for the configuration file.

resource stor 'Microsoft.Storage/storageAccounts@2025-06-01' = {
  name: 'stklabbetmatrixprod001'
  location: resourceGroup().location
  tags: resourceGroup().tags
  sku: {
    name: 'Standard_LRS'
  }
  kind: 'StorageV2'
  properties: {
    networkAcls: {
      bypass: 'AzureServices'
      defaultAction: 'Deny'
      virtualNetworkRules: [
        {
          id: resourceId('klabbet-matrix-prod', 'Microsoft.Network/virtualNetworks/subnets', 'vnet-klabbet-matrix-prod-001', 'snet-klabbet-matrix-prod-001')
          action: 'Allow'
        }
      ]
    }
  }
}

resource fileService 'Microsoft.Storage/storageAccounts/fileServices@2025-06-01' = {
  parent: stor
  name: 'default'
}

resource synapseData 'Microsoft.Storage/storageAccounts/fileServices/shares@2025-06-01' = {
  parent: fileService
  name: 'synapse-data'
  properties: {
    accessTier: 'Hot'
  }
}

resource masData 'Microsoft.Storage/storageAccounts/fileServices/shares@2025-06-01' = {
  parent: fileService
  name: 'mas-data'
  properties: {
    accessTier: 'Hot'
  }
}

The network rule makes sure that you can only reach the data from within the subnet. Since the configuration files contains secrets, this is necessary.

I believe that it’s possible to replace the secrets in the config files with environment variables. In that case you could put the secrets in Azure Key Vault and have them injected when the service starts. This is worth exploring if running this in production.

PostgreSQL

I preferred setting up a hosted Azure Database for PostgreSQL. Partly because it’s much more performant than SQLite. You offload the application server a lot when you have a dedicated database. Also because you get some nice features with Azure PostgreSQL like backups and data encryption at rest.

You can supply your own keys for the encryption to make sure that Microsoft can’t read your data.

resource database 'Microsoft.DBforPostgreSQL/flexibleServers@2025-08-01' = {
  name: 'pgsql-klabbet-matrix-prod-001'
  location: resourceGroup().location
  tags: resourceGroup().tags
  sku: {
    name: 'Standard_B1ms'  // Burstable, 1 vCore, 2GB RAM - cheapest option ~$12/month
    tier: 'Burstable'
  }
  properties: {
    version:'18'
    administratorLogin: dbUsername
    administratorLoginPassword: dbPassword
    storage: {
      storageSizeGB: 32
    }
    backup: {
      backupRetentionDays: 7
      geoRedundantBackup: 'Disabled'
    }
    highAvailability: {
      mode: 'Disabled'
    }
    network: {
      delegatedSubnetResourceId: resourceId('klabbet-matrix-prod', 'Microsoft.Network/virtualNetworks/subnets', 'vnet-klabbet-matrix-prod-001', 'snet-klabbet-matrixdb-prod-001')
      privateDnsZoneArmResourceId: privateDnsZoneId
    }
  }
}

resource pgExtensions 'Microsoft.DBforPostgreSQL/flexibleServers/configurations@2025-08-01' = {
  parent: database
  name: 'azure.extensions'
  properties: {
    value: 'pg_trgm'
    source: 'user-override'
  }
}

resource synapseDB 'Microsoft.DBforPostgreSQL/flexibleServers/databases@2025-08-01' = {
  name: 'synapse'
  parent: database
  properties: {
    charset: 'UTF8'
    collation: 'en_US.utf8'
  }
}

resource masDB 'Microsoft.DBforPostgreSQL/flexibleServers/databases@2025-08-01' = {
  name: 'mas'
  parent: database
  properties: {
    charset: 'UTF8'
    collation: 'en_US.utf8'
  }
}

I create two databases on this database server. One for Synapse and one for MAS. The extension pg_trgm is needed for these services to work.

App Service Plan

I only set up one compute and I think that is enough. This is the smallest (and cheapest) compute that you can get Matrix to run on. It will cost you about $35 per month.

resource appServicePlan 'Microsoft.Web/serverfarms@2025-03-01' = {
  name: 'asp-klabbet-matrix-prod-001'
  location: resourceGroup().location
  tags: resourceGroup().tags
  sku: {
    name: 'B1'
    tier: 'Basic'
    capacity: 1
  }
  kind: 'linux'
  properties: {
    // must be true if linux
    reserved: true
  }
}

I actually never found any issues with running on this compute. Matrix does eat up a lot of resources, but the service never felt sluggish. Instead I was surprised with how responsive it was.

Matrix hogs all the available memory on my B1 instance and it consumes quite a lot of CPU. Regardless the service felt very snappy.

App Service

I setup Synapse, MAS and Element Web as App Services on the same App Service Plan. I will only show Synapse here, they are pretty much identical. If you want to see all of it, go to the Github repository and read the code.

resource appService 'Microsoft.Web/sites@2025-03-01' = {
  name: 'as-klabbet-matrixsynapse-prod-001'
  location: resourceGroup().location
  tags: resourceGroup().tags
  properties: {
    serverFarmId: appServicePlan.id
    httpsOnly: true
    siteConfig: {
      alwaysOn: true
      linuxFxVersion: 'DOCKER|matrixdotorg/synapse:latest'
      ftpsState: 'Disabled'
      appSettings: [
        {
          name: 'DOCKER_ENABLE_CI'
          value: 'true'
        }
        {
          name: 'WEBSITES_PORT'
          value: '8008'
        }
      ]
      azureStorageAccounts: {
        'synapse-data': {
          type: 'AzureFiles'
          accountName: stor.name
          shareName: 'synapse-data'
          accessKey: stor.listKeys().keys[0].value
          mountPath: '/data'
        }
      }
    }
  }
}

In a production scenario you would not set docker container to latest version, but a specific version so you control when and how the service updates.

Here I connect the storage account to the docker container on the /data mount path. When Synapse starts it will go to this file service and look for homeserver.yaml for its configuration.

I also set up managed certificate and domain name for each app service. Go to the repo if you want to know how I did that.

Configuration

Once you have all the services up and running you need to configure both Synapse and MAS. You create configuration files that you drop in each file service where they are read during startup.

Synapse

You start by generating a basic configuration file by invoking the Docker container.

docker run -it --rm \
    --mount type=volume,src=synapse-data,dst=/data \
    -e SYNAPSE_SERVER_NAME=matrix.klabbet.dev \
    -e SYNAPSE_REPORT_STATS=yes \
    matrixdotorg/synapse:latest generate

Then you need to update the configuration file and upload it to the file service where the app can read it. It should be called homeserver.yaml.

server_name: "matrix.klabbet.dev"
public_baseurl: "https://synapse.matrix.klabbet.dev/"
serve_server_wellknown: true
pid_file: /data/homeserver.pid
enable_login: true
admins:
  - "@mikael:matrix.klabbet.dev"
listeners:
  - port: 8008
    tls: false
    type: http
    x_forwarded: true
    bind_addresses: ['0.0.0.0']
    resources:
      - names: [client, federation]
        compress: false
database:
  name: psycopg2
  args:
    user: <dbusername>
    password: <dbpassword>
    database: synapse
    host: pgsql-klabbet-matrix-prod-001.postgres.database.azure.com
    port: 5432
    cp_min: 5
    cp_max: 10
  allow_unsafe_locale: true
log_config: "/data/matrix.klabbet.dev.log.config"
media_store_path: /data/media_store
registration_shared_secret: 
report_stats: false
macaroon_secret_key: 
form_secret: 
signing_key_path: "/data/matrix.klabbet.dev.signing.key"
trusted_key_servers:
  - server_name: "matrix.org"
matrix_authentication_service:
  enabled: true
  endpoint: https://auth.matrix.klabbet.dev/
  secret:

There are some parts here that aren’t standard. In order to get the database working with Azure Database for PostgreSQL you need to include allow_unsafe_locale: true.

Matrix Authentication Service (MAS)

You also need to generate a configuration file for MAS.

docker run ghcr.io/element-hq/matrix-authentication-service config generate > config.yaml

You get a configuration file very much like this.

http:
  listeners:
  - name: web
    resources:
    - name: discovery
    - name: human
    - name: oauth
    - name: compat
    - name: graphql
    - name: assets
    binds:
    - address: '[::]:8080'
    proxy_protocol: false
  - name: internal
    resources:
    - name: health
    binds:
    - host: localhost
      port: 8081
    proxy_protocol: false
  trusted_proxies:
  - 192.168.0.0/16
  - 172.16.0.0/12
  - 10.0.0.0/10
  - 127.0.0.1/8
  - fd00::/8
  - ::1/128
  public_base: https://auth.matrix.klabbet.dev/
  issuer: https://auth.matrix.klabbet.dev/
database:
  host: pgsql-klabbet-matrix-prod-001.postgres.database.azure.com
  port: 5432
  username: <dbusername>
  password: <dbpassword>
  database: mas
  max_connections: 10
  min_connections: 0
  connect_timeout: 30
  idle_timeout: 600
  max_lifetime: 1800
email:
  from: '"Authentication Service" <root@localhost>'
  reply_to: '"Authentication Service" <root@localhost>'
  transport: blackhole
secrets:
  encryption: <encryptionsecret>
  keys:
  - key:
  - key:
  - key:
  - key:
passwords:
  enabled: true
  schemes:
  - version: 1
    algorithm: argon2id
  minimum_complexity: 3
matrix:
  kind: synapse
  homeserver: matrix.klabbet.dev
  secret: <sharedsecret>
  endpoint: "https://synapse.matrix.klabbet.dev"

Unless you have an e-mail server you need to configure transport: blackhole. Then it is prefered if users need to register with their e-mail address. While setting up the server and the admin user you might need to have the following configuration

account:
  password_registration_enabled: true
  password_registration_email_required: false

It makes sure that you can create an account without e-mail.

Upload the file to the MAS file service and call it config.yml for MAS to find it.

Summary

Installing Matrix on an Azure App Service Plan might not have been the best idea I had, but it works! It actually works very well and it saves me money. Instead of paying $100 per month for running it on the cheapest Azure Kubernetes Service, I get away with $44.

What I like about this setup is that costs will be quite flat. It will not really increase with time. After one week of active usage we managed to reach 12 MB on the database and 8 MB on the storage account. The database has 35GiB available and the Storage Account 100 TiB. It will take some time before we reach maximum capacity there ;D

This was a fun experiment. Things I would consider if doing this for a real production scenario

Not running it in Azure if the idea is to be independent from tech giants 🤣
Use Kubernetes as there are a more resources on getting it running on Kubernetes
Use an nginx reverse proxy for ingress to avoid having different server name and host
Install the Element Call as well for video conferencing
Make better use of Azure Key Vault by adding secrets as environment variables populated by AKV

This article was written without the use of generative AI.

Rethinking Trust in Slack and Teams: A Deep Dive into Matrix

In light of recent events, the US is no longer a viable business partner for the EU and we should be careful about whom we trust with our data. That said, this was not my concern when I started to look for a replacement for Slack and Teams a couple of months ago.

We write everything into Slack and Teams. If you search, you will find every business decision and every business strategy in our chat logs. Our chat data is extremely sensitive. That is why it scares me when Slack Inc. and Microsoft deploy AI models in these services.

AI search in Slack is one of those features that makes my skin crawl.

The privacy track record for AI has not been great, and very little is needed to prompt-engineer these bots into spilling the beans and starting summarize company secrets. What I want is an unintelligent™️ chat platform with a strong privacy focus – where I own the data and control what happens to it.

Matrix

What I found is Matrix, which is a protocol and not a product. This means there is a plethora of servers and clients that can communicate over Matrix. I tested this by installing the Synapse server and using the Element web and macOS clients, as well as the Element X client on iOS.

This gives me a chat experience very much like Slack used to be. I had not realized how bloated Slack has become, and my Matrix installation covers about 90% of my needs.

Matrix has many clients. The Element web client is really good.

You get a server where you can create spaces. Think of them as teams in Microsoft Teams. We could create spaces for each department or each project if a project includes enough people to warrant a whole space.

Within a space, you create channels. These can be public or private just like in Slack, and the chat is simply a timeline of messages. You can do the usual things: text formatting, upload media, insert code blocks with syntax highlighting, add polls, respond in threads or use quote reply.

One feature I really like is when you DM someone or create a private channel, the chat is end-to-end encrypted. This means messages are encrypted on the server and only decrypted on the client. If someone joins the chat later, they cannot read previous messages because their public key was not included in the discussion.

I really like this. It means that even if the database is leaked, our chat logs are safe.

The Element X iOS app works great. Here I’m testing to create a poll, share my location with Open Street Map and different text formatting options.

Even if your Matrix is set up as an isolated island, it can still communicate with other Matrix servers if you know the address. So you can chat with users on other Matrix servers. This might not sound special if you’re used to communicating across organizations in Slack or Teams, but remember those are still the same SaaS product. With Matrix, you communicate instance to instance, which I find pretty neat.

The Bad

Matrix is not plug-and-play. Heck, there isn’t even a “one-solution” install. To get started, you first need to research what server you want, which clients to support, how you want to authenticate users, and so on. And once you’ve done all that, installation is no easy feat.

Once installed, you also take on all the usual self-hosting headaches. You need to maintain the solution, secure it, ensure it has enough resources, and keep it updated. This takes time from your precious engineers.

And time is money.

I chat a lot about code and often send code back and forth in messages. That makes it very important that code is properly formatted and have syntax highlighting.

It took me 40 hours to get everything up and running, and I estimate it will require about 4 hours per month to keep it updated. Hosting costs around $50 per month for two users. The server capacity should be enough for roughly 20 users.

After using it actively with my friend for a week I have no complaints about the functionality. Sometimes it feels a bit hacky, as open-source software sometimes does, but everything works and I haven’t encountered any bugs. Even on the cheapest compute option, the system is very fast and responsive.

Summary

Is Matrix a platform that everyone should immediately switch to? Maybe not. It depends on how you value the privacy of your chat logs. If it’s important that you own your data and control how it is stored, backed up, and secured, then Matrix could be a solution for you.

My favorite feature of Matrix is that you can end-to-end encrypt not only direct messages but also a private channel. In case your database or backup gets leaked, the information in private channels is still protected.

If you’re based in the EU, I would definitely consider it. Putting all your data in the hands of US tech giants might not be sustainable in the long run. With Matrix, you can host the database and the entire solution within the EU. You control that none of your data is used to train AI, and you secure it within your own network.

This is worth a lot in today’s landscape, and should be worth the cost for a medium to large company.

Installing Matrix on your own infrastructure is not for the faint-hearted. In my next blog post, I’ll document my experience and the pitfalls I encountered.

Move Slow and Fix Things

Mark Zuckerberg coined the term “move fast and break things” and ever since, developers have been plagued by expectations to release half-baked solutions for the benefit of product owners’ experiments.

The obvious downside of this approach is low-quality solutions running in production. Experiments that were meant to be “temporary” tend to stay online if they prove successful. Products accumulate technical debt due to fast and temporary development, leading to high maintenance costs, longer release cycles, and more often than not complete rewrites.

Not that it’s all bad. I believe that “move fast and break things” played a significant role in the rise of DevOps. The incompetence of Mark Zuckerberg and other product owners forced us to rapidly evolve and adapt our ways of working to meet those expectations. Shift left and automate all the things.

Now we’re here again.

Product owners experimenting with Claude Code over the Christmas holidays are now demanding that we ship AI slop into production systems in pursuit of promised 10x productivity gains. We’re not ready, and the result will be very similar to “move fast and break things”. We’ll end up with broken, unmaintainable products running in production, riddled with bugs and security vulnerabilities.

We might find new processes and tools to guard against the decline in code quality, but I hope not.

I believe the real problem is that we’ve lost sight of good product management in favor of chasing the latest hype. Nobody wants your AI agent in their favorite software. We should return to purposeful product design – where we move slow and fix things. Let your product be known for its stability, for working year after year without bugs and for consistently delivering features that users really want.

I call for not moving fast and not breaking things. Not for chasing 10x productivity gains at the expense of stability, security and maintainability. I call for finding the core of your product and improving it incrementally, evolution instead of revolution.

I call for moving slow, and fixing things.

Four of a Kind – Azure Certifications for Software Developers

I have finally collected all four Azure credentials that I’ve been seeking. This was my goal for the last 6 months, and I achieved it today 28 November 2025.

Let me tell you a bit on why it was important for me to get certified, and why I selected these four certifications.

An Azure certification is an important tool if you’re a contractor, because it acts as a credential when looking for contracts. For me as a freelance consultant, I don’t have a big firm validating my knowledge, but have to stand completely on my own merit. One way to get someone to vouch for you, is to take an Azure certification. That way, Microsoft vouch that I know these topics that I’m certified on.

If you’re not freelance like me, it can still be helpful to take a certification, to increase your value within the company. These certifications are counted towards the company’s Microsoft partner level, which comes with benefits. The certificates are also personal, so if you plan on looking for a new job, they are a merit in your job search and might land you a better offer.

I’ve chosen to take the following certificates

I will give you my view on why these certificates are the most important for a software developer on the Microsoft stack.

Azure Administrator Associate

As a software developer this is a really cool certification as it helps you learn the things that you don’t come in contact with very often, like setting up an Azure subscription from scratch, Azure networking and how to secure your solution in Azure.

Even if this certificate is more directed at IT Operations, it’s knowledge that’s also very useful to know as a software developer.

Azure Developer Associate

This certification is a must have if you’re writing software that is run on Azure. It helps you understand how to write cloud native solutions, by utilizing the features that are provided in Azure. I have so many times seen developers reinventing the wheel, when there’s already a native Azure solution for the same problem.

This certification will help you learn about all those features, so you don’t have to implement them yourself.

Azure Solutions Architect Expert

If you are going to be consulting on Azure, you need to get this certification. It will help you get a grip on all the service offerings on Azure. You will get the birds eye view on governance, security and how Microsoft intends Azure to be used in an enterprise setting.

After completing this certification you will see Azure as a set of puzzle pieces and know how to fit them together into a working system.

DevOps Engineer Expert

This last certification, that I completed today, teaches you how to deliver software in a cloud environment. How can you shorten the cycle time, and at the same time increase quality and security in your software delivery pipeline.

Once you have these four certifications, you have a pretty good grip on how to develop, deliver and host software in a Microsoft setting.

This article was written without AI.

My AZ-305 Designing Microsoft Azure Infrastructure Solutions Study Path

I’ve been talking for years about getting the Solution Architect credential, but I’ve never put aside the amount of time needed. This latter half of this year I’ve decided to take 20% of the time I usually spend on clients and spend it on myself instead, and the first goal was to take the AZ-305 exam.

Note: I cannot say anything about the exam itself, as you’re made to sign an NDA not to, but I can tell you about my study path and how I first failed, and then succeeded.

First Try

I failed my first try at this exam, and from what I’ve gathered, it’s not uncommon. I spent about 36 hours of study time in the first round, and I focused on the study path that Microsoft supply on their certificate page.

This study path does not represent the knowledge you’re being tested on. I failed because I studied the wrong things. I got 634 points out of 1000 where 700 is the passing limit.

After failing I did a short retrospective with myself on what went wrong, found new resources to study and set at it again for another 3 weeks of intensive studying. I can be quite stubborn when my mind is set on something.

Second Try

I spent about 40 hours on my second round of studies. First of all I bought the MeasureUp AZ-305 Practice Test and I did all of the 168 questions in 4 sittings. The way I did it was that for every question, I pasted it into Chat GPT and then we discussed every possible answer, why it was right or wrong. This way I used the test to find my knowledge gaps. It was also a great way to discover and remember the things I got wrong, instead of just skipping to the next question. It helped me to get a better understanding about topics I’m not familiar with.

The practice questions can be questionable, but the act of going through and discussing them was most useful to me.

This was a great use case for AI, even if Chat GPT wasn’t always right, it helped me remember as I had to reason about the knowledge. I find that much better than just reading.

I should say, the MeasureUp test has questions that are close to the real exam, but some of the questions are infuriating, and I did find some that were plain wrong. While this sounds bad, getting angry is also a good way of remembering what you try to study.

After identifying my knowledge gaps I did a couple of labs in Azure. I setup scenarios in my own Azure tenant, created resources and tried different things. This was very useful for resources and features that I don’t use myself in my day-to-day work.

Availability sets, creating virtual machines in sets, setting up Azure Load Balancer and testing fail-over
Availability zones, creating virtual machines in different zones
Virtual machine scale sets, setting up an autoscaling cluster of machines
Azure Site Recovery, setting up replication of a machine in a different region
Azure Backup, playing around with the different backup options
Azure SQL where I setup different configurations of single Azure SQL, DTU tier, vCore Tier, Elastic Pool and Managed Instance
Azure Policy and Initiatives, creating policies and applying them to my subscriptions

I wanted to play around more with Microsoft Entra ID, but most of the things I wanted to lab with requires a P2 license, like conditional access, access reviews, PIM and ID Protection.

Another thing I did was I watched John Savill’s study cram on YouTube. While it’s very high level and not detailed enough to pass the exam, I found that sometimes he was saying things I didn’t know about, so I went ahead and looked it up to learn about it. I watched this during my commute over a span of 3 weeks.

John Savill is the GOAT for making these study cram videos. I think it was good repetition of the basics before the exam.

The last thing I did was that I got the AZ-305 Exam ref from Amazon. First I thought it was a waste of money, because it would be delivered before the day of my exam, but it arrived early and I spent a couple of evenings reading it through.

While it doesn’t contain all the details you need to know, it’s still a very good and dense walkthrough of everything on a high level, and sometimes very detailed as well. I can recommend getting it if you’re struggling with the exam.

The exam ref has all the bullet points of what you need to know. Maybe not all the details, but it’ it’s a good starting point.

With all this studying I was much more confident on my second try and I finished with 844 points out of 1000 where 700 is the passing score.

Summary

I think this certificate was quite hard, the hardest yet. The reason for me saying so, is that in my previous certificates Administrator and Developer I’ve felt quite at home by using the technology in my daily job. In this certificate they test that you know much about all of Azure, not only the parts that you are comfortable with.

It took me about 80 hours of effective study time to learn everything I needed and I don’t think it’s something that anyone would pass without study. Everyone has their part of Azure they’re comfortable with, and this tests on the whole platform.

Now I have the Administrator, the Developer and the Solution Architect certifications. The only one left that I’m interested in is the DevOps certificate so I guess I’ll do that next.

App Service Plan Random Restarts

I’m hosting a real-time system that is very dependent on low latency throughput and I’m doing it on Azure. In hindsight this might not have been the best choice as you have no control over the PaaS services and only a shallow insight over the IaaS service that Azure offers. In hindsight, when you’re writing a real-time system, deploy it on an environment where you control everything.

Last week we were starting to get problems that the system would have these interruptions. Randomly it looked like the system would stop working for 1-2 minutes and then be back to normal. First we thought it was the network, but after diagnosis of the whole system, we found that the App Service Plan was restarting and this was causing the interruptions.

The memory graph shows when an instance drops, a new one is booting up.

There is no log of this, but you can see it if you watch the App Service Plan metrics, and split the Memory Percentage on instance. You can see that new instances starts up when old ones are killed. While the new instance is starting up, we drop connections and the real-time system stops working for 1-2 minutes.

In a normal system this wouldn’t be a problem, because all requests would move over to the instance that is being live, and the users wouldn’t be affected, but we’re working with web sockets and they cannot be load balanced like that. Once they’re established, they will need to be reconnected if the instance goes down.

So this was bad for us!

These kind of issues are hard to troubleshoot because Azure App Service Plan is PaaS. You don’t have access to all the logs needed, but I found this tool when you go into the Azure App Service and select Resource Health / Diagnose and solve problems and search for Web App Restarted.

There a lots of diagnose tools for Azure App Service if you know where to find them. This one shows web app restarts.

This confirms the issue but really doesn’t tell us why the instances are restarting. Asking Chat GPT for common reasons for App Service Instance restarts, I got the following list

App Crashes
Out of Memory
Application Initialization Failures
Scaling or App Service Plan Configuration
Health Check Failures
App Service Restarts (Scheduled or Manual)
Underlying Infrastructure Maintenance (by Azure)

The one that stood out to me was “Health Check Failures” so I went into the Health Check feature on my App Service and used “Troubleshoot” but it said everything was fine. So I checked the requests to my /health endpoint and it told a different story.

The health check is failing a couple of times per day and this seems to be the cause of the App Service instance restarts.

The health checks are fine 99.99% of the times, but those 0.01% flukes will cause the instance to be restarted. Azure App Service will consider that the instance is unhealthy and restart it.

To test my theory I turned off health checks on my Azure App Service, and the problem went away. After evaluating for 24 hours we had zero App Service Instance restarts.

When I turned off health checks on Azure App Service, to test my theory, the problems with the restarts disappeared.

The problem is confirmed, but why are health checks failing? Digging a little deeper I found the following error message

Result: Health check failed: Key Vault
Exception: System.Threading.Tasks.TaskCanceledException: A task was canceled.

In my health checks I check that the service has all the dependencies it needs to work. It cannot be healthy if Azure Key Vault is inaccessible. In this case Azure Key Vault would return an error 4 times during 24 hours, and this would cause the health check to fail and the instances to be rebooted.

Why would it fail? This is could be anything. Maybe Microsoft was making updates to Azure Key Vault. Maybe there was a short interruption to the network. It doesn’t really matter. What matters is that this check should not restart the App Service instances, because the restart is a bigger problem than Key Vault failing 4 checks out of 2880.

Liveness and Readiness

Health checks are a good thing. I wouldn’t want to run the service without them, but we cannot have them restarting the service every hour. So we need to fix this.

I know of the concept of liveness and readiness from working with Kubernetes. I don’t know if this is a Kubernetes thing, but that is where I learned the concept.

Liveness means that the service is up. It has started and are responding to essentially ping.
Readiness means that the service is ready to receive traffic

What we could do, is to split health checks into liveness checks and readiness checks. Liveness checks would just return 200 OK so that Azure App Service health checks have an endpoint for evaluating the service.

The readiness checks would do what my health checks do today, verify that the service has all the dependences required for it to work. I would connect my Availability Checks to the readiness so I get a monitor alarm if the service is not ready.

The health checks are using the new liveness endpoint that doesn’t verify the dependencies.

The availability check use the new ready endpoint to verify that all dependencies are up and running.

The type or namespace name ‘TableOutputAttribute’ could not be found

This compilation error was about to drive me crazy. I wanted to use the TableOutput attribute on my Azure Function, but I couldn’t figure out what package and using I needed.

StackOverflow is a mash of questions about Azure Functions in-process and isolated-process and at times there is a question for isolated-process and the answers are for in-process. It doesn’t help asking Copilot because it cannot figure it out either.

Apparently, Microsoft.Azure.Functions.Worker.Extensions.Storage used to have this attribute, but they have separated Azure Blobs, Queues and Tables into separate extensions since version 5.0.0.

So if you want to use TableOutput, you need to reference Microsoft.Azure.Functions.Worker.Extensions.Tables and after that you don’t really need any other using than

using Microsoft.Azure.Functions.Worker.Extensions;

Monitoring Dead-Letter Messages on Azure Service Bus

A weird limitation of the Azure Service Bus is the monitoring capabilities. It doesn’t seem to be connected to Log Analytics at all, and the few metrics you can get from Azure Portal are very coarse.

You can only monitor the total amount of dead-lettered messages in a whole queue or topic.

I’m getting a steady stream of dead-lettered messages in my application, and it’s not useful for me setting a boundary as I would need to increase it ever so often, but I do want an alert if the rate of dead-lettered messages accelerates. How would I do that?

First you need to get the metrics into Log Analytics so that you can run queries and projections on it. One way to do this is to create an Azure Function that will check your metrics at intervals and write them to Application Insights. Here’s an example.

public class TimerServiceBusMonitorFunction
{
    private readonly ILogger _logger;
    private readonly TelemetryClient _telemetryClient;
    private readonly ServiceBusAdministrationClient _serviceBusAdministrationClient;

    public TimerServiceBusMonitorFunction(ILogger<TimerServiceBusMonitorFunction> logger, TelemetryClient telemetryClient, ServiceBusAdministrationClient serviceBusAdministrationClient)
    {
        _logger = logger;
        _telemetryClient = telemetryClient;
        _serviceBusAdministrationClient = serviceBusAdministrationClient;
    }

    [Function("TimerServiceBusMonitor")]
    // trigger every minute
    public async Task Run([TimerTrigger("0 */1 * * * *")] object timerDeadLettersMonitor,
        CancellationToken cancellationToken = default
        )
    {
        _logger.LogInformation("START TimerServiceBusMonitor");

        // get all topics
        var topics = _serviceBusAdministrationClient.GetTopicsAsync(cancellationToken);

        await foreach (var topic in topics)
        {
            _logger.LogDebug("Get subscriptions for topic {topic}", topic.Name);

            // get the subscriptions
            var subscriptionsProperties = _serviceBusAdministrationClient.GetSubscriptionsRuntimePropertiesAsync(topic.Name, cancellationToken);

            await foreach (var subscriptionProperties in subscriptionsProperties)
            {
                _logger.LogDebug("Report metrics for subscription {subscription}", subscriptionProperties.SubscriptionName);

                _telemetryClient.TrackMetric("Mgmt.ServiceBus.DeadLetters", subscriptionProperties.DeadLetterMessageCount, new Dictionary<string, string> {
                    { "Topic", topic.Name },
                    { "Subscription", subscriptionProperties.SubscriptionName }
                });

                _telemetryClient.TrackMetric("Mgmt.ServiceBus.ActiveMessageCount", subscriptionProperties.ActiveMessageCount, new Dictionary<string, string> {
                    { "Topic", topic.Name },
                    { "Subscription", subscriptionProperties.SubscriptionName }
                });

                _telemetryClient.TrackMetric("Mgmt.ServiceBus.TotalMessageCount", subscriptionProperties.TotalMessageCount, new Dictionary<string, string> {
                    { "Topic", topic.Name },
                    { "Subscription", subscriptionProperties.SubscriptionName }
                });
            }
        }

        _logger.LogInformation("STOP TimerServiceBusMonitor");
    }
}

This function has a dependency to ServiceBusAdministrationClient which I setup in my Program.cs like this.

var host = new HostBuilder()
    .ConfigureFunctionsWorkerDefaults()
    .ConfigureServices(services => {
        services.AddAzureClients(cfg => {

            // get name of the service bus from environment variable
            var serviceBusName = Environment.GetEnvironmentVariable("SERVICE_BUS_NAME")
                ?? throw new InvalidOperationException("Missing configuration SERVICE_BUS_NAME required.");
            
            // get the user identity client id from environment variable, if it is not set, use the default azure credential
            var userIdentityClientID = Environment.GetEnvironmentVariable("SERVICE_BUS_USER_MANAGED_IDENTITY_ID");

            // add service bus administration client
            cfg.AddServiceBusAdministrationClientWithNamespace($"{serviceBusName}.servicebus.windows.net")
                .WithCredential(string.IsNullOrEmpty(userIdentityClientID) ? new DefaultAzureCredential() : new ManagedIdentityCredential(userIdentityClientID));
        });
    })
    .Build();

host.Run();

Once deployed to your environment this function will start tracking the metrics of your service bus every minute. In order to get how many dead letters are created every 5 minutes, I have written the following Kusto query.

customMetrics
| where name == 'Mgmt.ServiceBus.DeadLetters'
| extend Subscription = tostring(customDimensions.Subscription)
| extend Topic = tostring(customDimensions.Topic)
| order by timestamp asc
| summarize StartValue = min(value),
            EndValue = max(value) by Topic, Subscription, bin(timestamp, 10m)
| extend AverageRateOfChange = (EndValue - StartValue)
| project Subscription, timestamp, AverageRateOfChange

With this I get the following graph, and the ability to set an alert if the application generates dead letters above my threshold.

Instead of the built in graph of how many messages there are in a topic, we can now get a graph on how many messages are added to a subscription. This is useful for monitoring the rate of dead lettered messages.

What’s a devContainer and what is it good for?

This is supposed to become a series of three parts, so I’m writing down the titles of the next parts here to incentivise me to write them

What’s a devContainer and what is it good for?
How to setup a devContainer with Visual Studio Code
Remote development with devContainers

This first article is an introduction to devContainers.

What is a devContainer?

You’ve probably heard about Docker containers and how you can package an application with the operating system to make it run on any hardware.

A devContainer is exactly that but for development environments. You write code, run and debug it inside a Docker container. The devContainer has all the tools you’ll need for your development, Git, dotnet, nodejs, you name it.

What problems does it solve?

Have you ever tried to onboard a new developer to the project and spent a day trying to get the development environment to run on his/hers machine? Was it the wrong version of nodejs installed or did they miss a Windows update?

A devContainer solves this by installing the correct versions of all dependencies from the Dockerfile.

Have you developed an application using the latest technologies, .NET 5 and then a year later when you are just going to fix an issue the application no longer builds because you have .NET 6 installed on your machine and there were some breaking changes between versions?

With devContainers you will stay on .NET 5 until _you_ decide it is time to upgrade the code base. The application will not stop working because you switched machines or the tools got outdated.

Have you ever had your development environment stop working because you share database with the team and someone else ran a database migration that you haven’t got yet?

With devContainers it is easy to setup dependencies like databases in the same Docker instance so everyone on the team has their own local database without any messy installations.

What applications can be devContained?

All applications that are targets for Docker could be using devContainers for development

Webservices
API’s
Databases
Expo Apps

Applications that doesn’t work as well with devContainers are Desktop, iOS and Android applications.

What tools are required?

The definition for a devContainer is written in a file called .devcontainer/devcontainer.json. This is usually accompanied with a Dockerfile or docker-compose.yml and various setup scripts.

In order to run the devContainer on your local machine you need to have Docker Desktop 2.0+ setup.

Visual Studio Code has the best integration with devContainers as of yet, and you’ll hardly notice that you’re working inside a Docker container.

Okay, but isn’t it weird?

No, you will hardly notice that you’re working inside a Docker container.

A Docker container is not a virtual machine. There is almost no performance penalties of working inside a Docker container.
The source tree is shared between the host and the container, so you can work with your code files just as normal.
Git credentials are automatically forwarded to the container so you do not need any extra authentication for your devContainer.
When starting your application inside a container, vscode will automatically forward the port to the host so you can see the result on your machine. Just open a web browser to localhost:5000 as you usually do and it works like magic.

This was a short introduction to devContainers. Setting one up for your project is very easy and what we’re going to look at in part 2.

Refactor Your Wetware

I’m running a book club with a group of people, where we read one book every sixth months. The group is a bunch of people all working with software some way or the other. The books that we’ve read have been very management oriented but this time around we got around reading on the topic of self improvement.

This book wants you to become aware of how you think, what you think and why you think the way that you do. It also provides a couple of tools to help you think deliberately.

Andy describes a model of thinking where he split the brain into the L-mode and the R-mode, with a shared bus in between. The L-mode is the active thinking you do when you concentrate and R-mode is the background thinking you do when you shut down L-mode. The shared bus means that you can only use L-mode or R-mode, but never both at the same time. Some problems, like pattern matching is easier to do with the R-mode, but in order to engage that line of thinking you need to stop focusing. This is why you solve problems while walking the dog, taking a shower or sleeping. You turn L-mode off and let R-mode do the pattern matching needed to solve a particular problem.

It is just a model and I wouldn’t say that anyone knows if this is the way our brain works, but it does map into my own experience with taking a walk over lunch time to find new perspectives on what I’ve been working on up to that point.

The book continues to build on this model and introduce you to biases and bugs in your brain. It provides tools to be able to alter your thinking and find new ways to think and to learn.

I thought this was a useful book and I would recommend it to you if you’re interested in thinking about thinking.