Home > Azure’s Native Run Tools
David Frappart
10 March 2022
Lire cet article en Français

Azure’s Native Run Tools

Post co-written by David Frappart (Cellenza) & Florent Hilger and Sébastien Leroy (Squadra)

 

With the current maturity of Cloud platforms, production environments have inevitably arrived.

The first step was to digest the cloud model embedded in these platforms and adapt “legacy,” or non-cloud, architectures to these new platforms.

Now it’s time to move on to the next step: mastering cloud operations.

We’ll try to provide some clarifications and pointers in this article to help you achieve this goal, specifically on the Azure platform.

We won’t claim to be able to cover everything in one post, so we’ll focus on the following topics:

  • First, starting from the basis that the Run is defined by knowing which operations must be performed, we’ll of course cover observability.
  • Next, we’ll look at application protection concepts, specifically backup and recovery.
  • Finally, we’ll describe our current approach based on our wealth of experience.

 

 

Observability Tools in Azure

 

Because the cloud has changed the way architectures are designed, it’s only natural that observability has also been affected.

Fortunately, the Azure platform natively has a rich and extensible ecosystem, allowing you to connect third-party observability systems.

Speaking of observability, we’ll need to define metrics and collect logs. Logs allow us to track system errors and define error rates, among other things. Metrics provide valuable information like website latency or microservice saturation.

We’ll also need alerts or dashboards to keep track of these logs and metrics.

 

Metrics

Azure Monitor Metrics provides access to metrics in Azure. According to the Microsoft documentation, they can support “near real-time scenarios” due to their lightweight nature.

Metrics can be explored on a live resource by navigating to the “Metrics” menu and adding the desired metric.

 

Metrics in Azure Monitor

Metrics in Azure Monitor

 

Initially, this is a simple way to configure resource observability.

Once out of exploration mode, you can use the Azure documentation to get all the metrics available for a resource type.

 

Microsoft.ContainerService/managedClusters

 

Metric Exportable via Diagnostic Settings? Metric Display Name Unit Aggregation Type Description Dimensions
apiserver_current_inflight_requests No Inflight Requests Count Average Maximum number of currently used inflight requests on the API server per request kind in the last second requestKind
cluster_autoscaler_cluster_safe_to_autoscale No Cluster Health Count Average Determines whether or not cluster autoscaler will take action on the cluster No dimension
cluster_autoscaler_scale_down_in_cooldown No Scale Down Cooldown Count Average Determines if the scale down is in cooldown. No nodes will be removed during this time frame No dimension

Sample of Azure platform metrics

 

It’s possible to create alerts or generate a dashboard to aggregate the metrics of a subset of targeted resources based on certain predefined thresholds (more on this in the “Alerting” and “Dashboarding” sections).

 

Logs

 

Let’s now look at the logs.

We have three types of logs:

  • activity logs
  • resource logs
  • the third log type isn’t really a log type at all because it’s the Azure Active Directory logs

To summarize, we can use this table from the Azure documentation:

 

Log Layer Description
Resource logs Azure Resources Provide insight into operations that were performed within an Azure resource (the data plane), for example, getting a secret from a Key Vault or making a request to a database. The content of resource logs varies by the Azure service and resource type.

Resource logs were previously referred to as diagnostic logs.

Activity log Azure subscription Provides insight into the operations on each Azure resource in the subscription from the outside (the management plane) in addition to updates on Service Health events. Use the Activity Log to determine the what, who, and when for any write operations (PUT, POST, DELETE) taken on the resources in your subscription. There is a single activity log for each Azure subscription.
Azure Active Directory logs Azure Tenant Contains the history of sign-in activity and audit trail of changes made in the Azure Active Directory for a particular tenant.

Description of Azure log types

 

To dig deeper into the Azure resource logs, which are also collected in Azure Monitor Logs, we need:

  • native resource logs
  • logs and performance data collected by virtual machine agents
  • application logs collected with Application Insight

Details of resource logs

Details of resource logs

 

Except for activity logs, which are available for three months directly from the Azure portal, all other logs are available if you enable them in the Diagnostic settings.

We can specify the storage location when configuring this collection:

  • An Azure storage account
  • A Log Analytics workspace
  • An Event Hub
  • Third-party solutions available on the Azure marketplace

 

Configuring logs for an Azure resource

Azure resource log configuration

 

Type of log and possible destination in Azure

Log type and possible destination in Azure

 

The storage account is typically used for long retentions and is relatively inexpensive.

The Log Analytics workspace lets you query the logs directly using Microsoft’s Kusto Query Language (KQL). However, the cost of ingesting logs makes it unsuitable for long-term retention.

The most common use of an Event Hub is to send logs to external systems.

Recently, the ability to add partner solutions available on the marketplace has been added. This can be an appealing option for lowering the cost of implementing a new tool (for example, switching from a known query language to Kibana vs. KQL).

 

Partner solution compatible as Azure log target

Partner solution compatible as Azure log destination target

 

 

Alerting in Azure Monitor

 

Azure Monitor can, of course, set up alerts.

Configuring alerts in Azure Monitor

Configuring alerts in Azure Monitor

 

Sample Azure Alerts

Sample Azure alerts

 

Signals available for Azure alerts

Available signals for Azure alerts

 

These alerts are based on metrics and activity log conditions. However, custom alerts based on KQL queries can be configured for logs ingested in a Log Analytics workspace.

 

Sample rule based on a KQL query

Sample rule based on a KQL query

 

These alerts must be accompanied by an Action Group that determines the alert notification method. These notifications can be sent via email or through Logic apps, Azure functions, or webhooks.

 

Actions available in an Action Group

Actions available in an Action Group

 

Dashboarding in Azure Monitor

 

Apart from simple notification, visualizing a system’s status is also important for managing operations. You can, of course, connect Azure Monitor to third-party dashboarding solutions.

However, dashboards can also be created in Azure directly. These objects behave like Azure resources in that they can be assigned access via Role-Based Access Control (RBAC) roles.

To create one, simply select a metric and click “Pin to Dashboard.”

A KQL query can be written from a Log Analytics workspace to generate charts that can be used to enrich a dashboard.

 

Example of a dashboard built from metrics and KQL queries

An example of a dashboard created using metrics and KQL queries

 

It’s worth noting the existence of libraries of workbooks, either built-in or customized, made up of templates of specific KQL queries and capable of providing a graphical rendering either from the workbook menu of a resource or via the “Pin to Dashboard” button to supplement a view.

Some workbook templates

Some workbook templates

 

Workbook AKS

AKS workbook

 

Lastly, exporting dashboards gives you the option of industrializing their creation.

 

Protecting Resources in Azure

 

When considering operations, we also think about backup and restore. Like monitoring, most PaaS solutions provide data protection configuration options. For example, an Azure Database for MySQL server can be configured with automated backups and Point in Time Restore options.

In some cases, the backup will rely on native solutions such as Recovery Service Vault, which under certain conditions supports the protection of virtual machines (VMs) in Azure or elsewhere, but also storage accounts such as Blob or Files, or, more recently, Azure Database for PostgreSQL.

 

Recovery Service Vault

Recovery Service Vault

 

Prepare Operations from the Design and Build Phases

 

Observability

 

The adoption of infrastructure as code is a natural consequence of cloud adoption.

When resource logs with alerts that are cloud resources by nature are configured, they can be configured via infrastructure as code (IaC) and included in the build.

The log destination will be the fundamental building block for configuring the resource logs. Remember that the possible destinations are:

  • storage accounts
  • Log Analytics workspaces
  • or Event Hubs

Therefore, these are Azure resources that can be configured via infrastructure as code.

Whether it’s a global log destination for an Azure subscription or a subset of resources, defining the log source during the upstream, landing zone, or Azure project phases is critical.

 

Example of an Azure Monitor database metric configured in Terraform

 

To summarize:

  • In the design phase:
    • Define the signals that indicate a cloud platform’s/application’s state
    • Define metric alerts for the above indicators
    • Define resource logs to supplement these indicators
    • Define destinations for these logs
    • Define and standardize relevant dashboards and workbooks
  • In the build phase:
    • Create the Azure Monitor alerts defined in design
    • Configure each resource’s log destination
    • Create KQL-based alerts if needed
    • Create dashboards from templates

 

 

Backup

 

If an Azure architecture brick has a native protection solution, the configuration of that protection can be added, as can additional Azure resources like Recovery Vaults and associated backup policies as needed. As with Azure Monitor, the choice of shared or subscription-based protection services must be made during the design phase.

 

Example of a Terraform configuration with resource protection

 

Example Terraform backup policy configuration

 

Run in Azure: The Essentials

 

A word from Squadra, an Azure Run expert:

 

Run is the process of keeping the services implemented by the Build in operational condition. So, for the Run to be effective from the start of the project, it must be considered during the Build phase. Applying this to the automotive industry, it would be like entering the 24 Hours of Le Mans and hoping to win without any technical support, logistics, parts, or a driver.

 

This means workshops must be held alongside construction to determine KPIs, key infrastructure metrics, alerts, and relevant countermeasures. For example:

·       Maintenance window that could affect the user service

·       Adding a Software Development Kit (SDK) to the Azure web service to obtain metrics

 

The Run teams should conduct these workshops with all participants, especially the application managers. These application managers can determine their own critical KPIs (excluding infrastructure).

 

A well-defined Run allows you to be proactive and make more targeted improvements (e.g., capacity planning). Teams of “runners” must constantly monitor the components and adapt them to the customer’s context as the editor or application changes.

 

All projects must use DevOps to maintain efficiency between the Build and the Run. With these mechanics, we can adapt to changes as they occur, thereby ensuring maximum availability.

 

We have seen in this post that Run in Azure can rely heavily on platform native solutions.

We also stressed the importance of defining indicators and tools for protecting Cloud assets as early as possible to prepare for cloud operations in IaC mode.

However, we have only scratched the surface of the Action Group topic, which can be used to integrate remediation or responses to automated events beyond simple notification.

Would you like to learn more about how the Build affects the Run? See our new series of posts on the subject:

This posts should interest you
Comments
Leave a Reply

Receive the best of Cloud, DevOps and IT news.
Receive the best of Cloud, DevOps and IT news.