David Frappart

10 March 2022

Lire cet article en Français

Azure’s Native Run Tools

Post co-written by David Frappart (Cellenza) & Florent Hilger and Sébastien Leroy (Squadra)

With the current maturity of Cloud platforms, production environments have inevitably arrived.

The first step was to digest the cloud model embedded in these platforms and adapt “legacy,” or non-cloud, architectures to these new platforms.

Now it’s time to move on to the next step: mastering cloud operations.

We’ll try to provide some clarifications and pointers in this article to help you achieve this goal, specifically on the Azure platform.

We won’t claim to be able to cover everything in one post, so we’ll focus on the following topics:

First, starting from the basis that the Run is defined by knowing which operations must be performed, we’ll of course cover observability.
Next, we’ll look at application protection concepts, specifically backup and recovery.
Finally, we’ll describe our current approach based on our wealth of experience.

Observability Tools in Azure

Because the cloud has changed the way architectures are designed, it’s only natural that observability has also been affected.

Fortunately, the Azure platform natively has a rich and extensible ecosystem, allowing you to connect third-party observability systems.

Speaking of observability, we’ll need to define metrics and collect logs. Logs allow us to track system errors and define error rates, among other things. Metrics provide valuable information like website latency or microservice saturation.

We’ll also need alerts or dashboards to keep track of these logs and metrics.

Metrics

Azure Monitor Metrics provides access to metrics in Azure. According to the Microsoft documentation, they can support “near real-time scenarios” due to their lightweight nature.

Metrics can be explored on a live resource by navigating to the “Metrics” menu and adding the desired metric.

Metrics in Azure Monitor

Metrics in Azure Monitor

Initially, this is a simple way to configure resource observability.

Once out of exploration mode, you can use the Azure documentation to get all the metrics available for a resource type.

Microsoft.ContainerService/managedClusters

Metric	Exportable via Diagnostic Settings?	Metric Display Name	Unit	Aggregation Type	Description	Dimensions
apiserver_current_inflight_requests	No	Inflight Requests	Count	Average	Maximum number of currently used inflight requests on the API server per request kind in the last second	requestKind
cluster_autoscaler_cluster_safe_to_autoscale	No	Cluster Health	Count	Average	Determines whether or not cluster autoscaler will take action on the cluster	No dimension
cluster_autoscaler_scale_down_in_cooldown	No	Scale Down Cooldown	Count	Average	Determines if the scale down is in cooldown. No nodes will be removed during this time frame	No dimension

Sample of Azure platform metrics

It’s possible to create alerts or generate a dashboard to aggregate the metrics of a subset of targeted resources based on certain predefined thresholds (more on this in the “Alerting” and “Dashboarding” sections).

Logs

Let’s now look at the logs.

We have three types of logs:

activity logs
resource logs
the third log type isn’t really a log type at all because it’s the Azure Active Directory logs

To summarize, we can use this table from the Azure documentation:

Log	Layer	Description
Resource logs	Azure Resources	Provide insight into operations that were performed within an Azure resource (the data plane), for example, getting a secret from a Key Vault or making a request to a database. The content of resource logs varies by the Azure service and resource type. Resource logs were previously referred to as diagnostic logs.
Activity log	Azure subscription	Provides insight into the operations on each Azure resource in the subscription from the outside (the management plane) in addition to updates on Service Health events. Use the Activity Log to determine the what, who, and when for any write operations (PUT, POST, DELETE) taken on the resources in your subscription. There is a single activity log for each Azure subscription.
Azure Active Directory logs	Azure Tenant	Contains the history of sign-in activity and audit trail of changes made in the Azure Active Directory for a particular tenant.

Description of Azure log types

To dig deeper into the Azure resource logs, which are also collected in Azure Monitor Logs, we need:

native resource logs
logs and performance data collected by virtual machine agents
application logs collected with Application Insight

Details of resource logs

Details of resource logs

Except for activity logs, which are available for three months directly from the Azure portal, all other logs are available if you enable them in the Diagnostic settings.

We can specify the storage location when configuring this collection:

An Azure storage account
A Log Analytics workspace
An Event Hub
Third-party solutions available on the Azure marketplace

Configuring logs for an Azure resource

Azure resource log configuration

Type of log and possible destination in Azure

Log type and possible destination in Azure

The storage account is typically used for long retentions and is relatively inexpensive.

The Log Analytics workspace lets you query the logs directly using Microsoft’s Kusto Query Language (KQL). However, the cost of ingesting logs makes it unsuitable for long-term retention.

The most common use of an Event Hub is to send logs to external systems.

Recently, the ability to add partner solutions available on the marketplace has been added. This can be an appealing option for lowering the cost of implementing a new tool (for example, switching from a known query language to Kibana vs. KQL).

Partner solution compatible as Azure log target

Partner solution compatible as Azure log destination target

Alerting in Azure Monitor

Azure Monitor can, of course, set up alerts.

Configuring alerts in Azure Monitor

Configuring alerts in Azure Monitor

Sample Azure Alerts

Sample Azure alerts

Signals available for Azure alerts

Available signals for Azure alerts

These alerts are based on metrics and activity log conditions. However, custom alerts based on KQL queries can be configured for logs ingested in a Log Analytics workspace.

Sample rule based on a KQL query

Sample rule based on a KQL query

These alerts must be accompanied by an Action Group that determines the alert notification method. These notifications can be sent via email or through Logic apps, Azure functions, or webhooks.

Actions available in an Action Group

Actions available in an Action Group

Dashboarding in Azure Monitor

Apart from simple notification, visualizing a system’s status is also important for managing operations. You can, of course, connect Azure Monitor to third-party dashboarding solutions.

However, dashboards can also be created in Azure directly. These objects behave like Azure resources in that they can be assigned access via Role-Based Access Control (RBAC) roles.

To create one, simply select a metric and click “Pin to Dashboard.”

A KQL query can be written from a Log Analytics workspace to generate charts that can be used to enrich a dashboard.

Example of a dashboard built from metrics and KQL queries

An example of a dashboard created using metrics and KQL queries

It’s worth noting the existence of libraries of workbooks, either built-in or customized, made up of templates of specific KQL queries and capable of providing a graphical rendering either from the workbook menu of a resource or via the “Pin to Dashboard” button to supplement a view.

Some workbook templates

Some workbook templates

Workbook AKS

AKS workbook

Lastly, exporting dashboards gives you the option of industrializing their creation.

Protecting Resources in Azure

When considering operations, we also think about backup and restore. Like monitoring, most PaaS solutions provide data protection configuration options. For example, an Azure Database for MySQL server can be configured with automated backups and Point in Time Restore options.

In some cases, the backup will rely on native solutions such as Recovery Service Vault, which under certain conditions supports the protection of virtual machines (VMs) in Azure or elsewhere, but also storage accounts such as Blob or Files, or, more recently, Azure Database for PostgreSQL.

Recovery Service Vault

Recovery Service Vault

Prepare Operations from the Design and Build Phases

Observability

The adoption of infrastructure as code is a natural consequence of cloud adoption.

When resource logs with alerts that are cloud resources by nature are configured, they can be configured via infrastructure as code (IaC) and included in the build.

The log destination will be the fundamental building block for configuring the resource logs. Remember that the possible destinations are:

storage accounts
Log Analytics workspaces
or Event Hubs

Therefore, these are Azure resources that can be configured via infrastructure as code.

Whether it’s a global log destination for an Azure subscription or a subset of resources, defining the log source during the upstream, landing zone, or Azure project phases is critical.

Example of an Azure Monitor database metric configured in Terraform

To summarize:

In the design phase:
- Define the signals that indicate a cloud platform’s/application’s state
- Define metric alerts for the above indicators
- Define resource logs to supplement these indicators
- Define destinations for these logs
- Define and standardize relevant dashboards and workbooks
In the build phase:
- Create the Azure Monitor alerts defined in design
- Configure each resource’s log destination
- Create KQL-based alerts if needed
- Create dashboards from templates

Backup

If an Azure architecture brick has a native protection solution, the configuration of that protection can be added, as can additional Azure resources like Recovery Vaults and associated backup policies as needed. As with Azure Monitor, the choice of shared or subscription-based protection services must be made during the design phase.

Example of a Terraform configuration with resource protection

Example Terraform backup policy configuration

Run in Azure: The Essentials

A word from Squadra, an Azure Run expert:

Run is the process of keeping the services implemented by the Build in operational condition. So, for the Run to be effective from the start of the project, it must be considered during the Build phase. Applying this to the automotive industry, it would be like entering the 24 Hours of Le Mans and hoping to win without any technical support, logistics, parts, or a driver.

This means workshops must be held alongside construction to determine KPIs, key infrastructure metrics, alerts, and relevant countermeasures. For example:

· Maintenance window that could affect the user service

· Adding a Software Development Kit (SDK) to the Azure web service to obtain metrics

The Run teams should conduct these workshops with all participants, especially the application managers. These application managers can determine their own critical KPIs (excluding infrastructure).

A well-defined Run allows you to be proactive and make more targeted improvements (e.g., capacity planning). Teams of “runners” must constantly monitor the components and adapt them to the customer’s context as the editor or application changes.

All projects must use DevOps to maintain efficiency between the Build and the Run. With these mechanics, we can adapt to changes as they occur, thereby ensuring maximum availability.

We have seen in this post that Run in Azure can rely heavily on platform native solutions.

We also stressed the importance of defining indicators and tools for protecting Cloud assets as early as possible to prepare for cloud operations in IaC mode.

However, we have only scratched the surface of the Action Group topic, which can be used to integrate remediation or responses to automated events beyond simple notification.

Would you like to learn more about how the Build affects the Run? See our new series of posts on the subject:

This posts should interest you

Scheduled Pipelines: Schedule and Automate Your Deployment Workflows

Albert PHAN

Cloud

4 April 2024

Over 4 million organizations have chosen GitHub to host their projects. It seems the logical choice when you consider all...

Federated Credentials: How to Deploy an Azure Pipeline without Secrets or Certificates

Albert PHAN

Cloud

30 January 2024

Historically, DevOps professionals have used secrets or certificates to authenticate their Microsoft Entra ID applications and services (source). Starting in...

Microsoft Azure’s Layer 7 load balancing solutions

souha bel haj hassine

Cloud

23 November 2023

Exposing applications on the Internet is always a complex matter. Usually, it requires some planning to be able to expose...

What is Azure Virtual Desktop?

Ludovic DOUAUD

Cloud

14 November 2023

Azure Virtual Desktop (AVD) is a Microsoft solution that allows you to access a secure remote desktop from anywhere. What...

How to Quickly Deploy Swagger Documentation for Your API in Azure

Albert PHAN

Cloud

14 September 2023

Creating and using Application Programming Interfaces (APIs) has become a major challenge for all IT companies. According to RapidAPI, it...

Step by Step to Enable Windows LAPS with Azure AD

Hichem Mabrouki

Cloud

7 September 2023

According to Microsoft documentation, Windows LAPS (Windows Local Administrator Password Solution) is “a Windows feature that automatically manages and backs...

Comments