Azure Well-Architected Framework

Microsoft's opinionated guide to building production-grade workloads on Azure. Five pillars, a self-assessment tool, and landing zone patterns that enforce the rules automatically.

What the framework is and why Microsoft built it

The Azure Well-Architected Framework is a set of architecture principles, design reviews, and assessment tools that Microsoft published to help teams build workloads that don't fall apart under production pressure. Microsoft released the first version in 2020, modeled loosely on AWS's Well-Architected Framework but restructured around Azure-native services and Microsoft's own operational experience running Office 365, Xbox Live, and Azure itself. The framework isn't a checklist you run once before go-live. It's a continuous evaluation model. Microsoft designed it because they kept seeing the same patterns of failure across enterprise Azure deployments: teams provisioning single-instance VMs with no availability sets, exposing storage accounts to the public internet, ignoring cost anomalies until the monthly bill arrived, and deploying without any monitoring beyond default Azure Monitor metrics. Each of these failures maps to one of the framework's five pillars. The framework sits at the intersection of architecture guidance and tooling. It's not just documentation. Microsoft built the Azure Well-Architected Review tool, integrated Advisor recommendations into the portal, and created Azure Policy definitions that enforce framework principles at the resource level. The Azure Architecture Center hosts reference architectures for over 40 workload types, each annotated with pillar-specific guidance. For example, the reference architecture for an AKS baseline cluster includes specific recommendations for pod disruption budgets under Reliability, network policies under Security, reserved instances under Cost Optimization, Flux GitOps under Operational Excellence, and horizontal pod autoscaling under Performance Efficiency. The framework also introduced the concept of workload personas, recognizing that a machine learning training pipeline has different architectural requirements than a customer-facing web app. Each persona gets tailored pillar guidance rather than generic advice.

The five pillars and what each actually requires

Reliability is about keeping your workload running when components fail. On Azure, this means deploying across Availability Zones using zone-redundant services like ZRS Storage Accounts, Azure SQL with zone-redundant HA, and AKS clusters with node pools spread across three zones. It means setting up Azure Front Door or Traffic Manager for multi-region failover with health probes that check application logic, not just TCP connectivity. The framework calls for defining a recovery time objective and recovery point objective for every tier of your application, then validating those targets with chaos engineering using Azure Chaos Studio. Security requires defense in depth starting from identity. The framework insists on Microsoft Entra ID (formerly Azure AD) with conditional access policies, Privileged Identity Management for just-in-time admin access, and managed identities for service-to-service authentication so no credentials exist in configuration files. Network security means private endpoints for PaaS services, Azure Firewall or third-party NVAs for east-west traffic inspection, and Azure DDoS Protection Standard for internet-facing workloads. Microsoft Defender for Cloud provides the continuous assessment. Cost Optimization starts with right-sizing and commitment discounts. The framework recommends Azure Advisor cost recommendations, Azure Reservations for predictable workloads (one-year or three-year terms for VMs, SQL, Cosmos DB), and Azure Spot VMs for fault-tolerant batch processing. It also calls for tagging every resource with a cost center and environment tag, then using Azure Cost Management budgets with action groups that alert or auto-shutdown when spending exceeds thresholds. Operational Excellence covers deployment practices, monitoring, and incident response. The framework prescribes Infrastructure as Code via Bicep or Terraform, CI/CD through Azure DevOps or GitHub Actions, and observability through Azure Monitor, Log Analytics workspaces, and Application Insights with distributed tracing enabled. Performance Efficiency addresses scaling and latency. This pillar recommends Azure Autoscale for compute, Azure CDN or Front Door for static content, Azure Cache for Redis to offload database reads, and Azure Load Testing to establish performance baselines before production launches.

The Azure Well-Architected Review tool and how to use it

The Azure Well-Architected Review is a web-based self-assessment that asks you 50 to 80 questions across the five pillars, tailored to the workload type you select. You access it at aka.ms/well-architected/review. The tool isn't a generic questionnaire. When you select a workload type like 'Web application on App Service,' the questions focus on App Service-specific configurations: are you using deployment slots for zero-downtime releases, have you configured health check probes at the App Service level, is your App Service Plan running on a Premium v3 SKU for zone redundancy support. Each question offers multiple-choice answers and links to the relevant Azure documentation. After completing the assessment, the tool generates a prioritized list of recommendations organized by pillar and severity. It scores each pillar from 0 to 100 and highlights the gaps that carry the most risk. The recommendations aren't vague. They point to specific Azure services, configurations, and sometimes even Azure CLI commands or Bicep templates you can apply directly. Teams at organizations like Maersk and Volkswagen have reported using the review tool as part of their architecture governance process, running assessments quarterly and tracking pillar scores over time. Microsoft also built Azure Advisor integration that maps individual Advisor recommendations to Well-Architected pillars, so you get continuous automated assessment alongside the periodic manual review. The review tool generates a shareable report in PDF format that architects present to leadership as evidence of due diligence. It's particularly useful during compliance audits where regulators want documented proof that you've evaluated your cloud architecture against a recognized framework. The tool also supports custom lenses, so organizations can add their own questions and best practices on top of Microsoft's defaults. This matters for enterprises with specific regulatory requirements like HIPAA or PCI DSS that need additional architecture constraints beyond what the standard framework covers.

Landing zones and how they enforce the framework automatically

Azure Landing Zones are pre-configured Azure environments that encode Well-Architected principles into the infrastructure itself. Instead of hoping that every team reads the framework documentation and follows it voluntarily, landing zones enforce it through Azure Policy, management group hierarchy, and RBAC assignments. Microsoft's Azure Landing Zone Accelerator deploys a hub-spoke topology with Azure Firewall in the connectivity subscription, a shared services subscription for Log Analytics and Microsoft Defender for Cloud, and separate workload subscriptions that inherit policies from parent management groups. The management group structure typically looks like this: a root tenant group, then platform management groups for Identity, Management, and Connectivity, then landing zone management groups for Corp (internal workloads connected to the hub) and Online (internet-facing workloads). Each management group has Azure Policy assignments that enforce specific Well-Architected principles. The Corp management group might enforce that all storage accounts deny public blob access. The Connectivity management group might enforce that all virtual networks peer to the hub. The Online management group might require Azure DDoS Protection Standard on every public IP. Landing zones solve the governance problem that the framework alone can't. You can document best practices all day, but without enforcement, teams under deadline pressure will skip them. Azure Policy runs at deployment time and blocks non-compliant resources from being created. A developer trying to create a storage account with public access enabled in a subscription governed by the Corp landing zone will get a deployment error, not a warning email three weeks later. The Bicep and Terraform modules for Azure Landing Zones are open-source on GitHub under the Azure organization. The Terraform module (caf-enterprise-scale) has over 1,500 stars and is maintained by Microsoft's Cloud Adoption Framework team. These modules parameterize the landing zone configuration, so you can customize which policies to enforce, which regions to allow, and how many workload subscriptions to create.

Common Azure anti-patterns the framework catches

The framework identifies anti-patterns that appear in almost every Azure environment that hasn't been through a Well-Architected Review. The single-region deployment is the most dangerous. Teams deploy everything to West Europe or East US and assume Azure's SLA covers them. It doesn't. Azure's compute SLA for a single VM is 99.9%, which allows 8.7 hours of downtime per year. Spreading across Availability Zones with a zone-redundant load balancer brings this to 99.99%. Adding a second region with active-passive failover reaches five-nines territory. The overprovisioned VM anti-pattern appears when teams select D-series VMs with 32 vCPUs and 128 GB RAM for workloads that average 12% CPU utilization. The framework recommends reviewing Azure Advisor right-sizing recommendations weekly and considering B-series burstable VMs for workloads with spiky CPU profiles. The shared-database anti-pattern puts multiple microservices on a single Azure SQL database. This creates coupling between services, makes independent scaling impossible, and turns schema migrations into coordinated deployments across teams. The framework recommends database-per-service with Azure SQL Elastic Pools or Cosmos DB for services that need independent data stores. The unmonitored dependency anti-pattern occurs when teams instrument their own services with Application Insights but don't monitor third-party dependencies. If your application calls a payment provider's API and that API starts responding in 3 seconds instead of 300 milliseconds, you need alerting on that dependency latency. Application Insights dependency tracking captures this automatically, but teams need to set alert rules. The manual deployment anti-pattern still persists in enterprises. Teams deploy Azure resources through the portal by clicking through wizards instead of using Bicep, Terraform, or ARM templates. Portal deployments are unreproducible, unauditable, and impossible to roll back cleanly. The framework requires Infrastructure as Code for every environment. The unrestricted network access anti-pattern leaves Azure SQL, Storage Accounts, and Key Vaults accessible from the public internet. The framework mandates private endpoints for all PaaS services and using service endpoints or Private Link for service-to-service communication within the virtual network.

Diagramming Azure architectures that follow the framework

Architecture diagrams are the primary artifact for communicating how a workload aligns with the Well-Architected Framework. A well-drawn Azure diagram shows not just the services in use, but the patterns that enforce each pillar. For Reliability, the diagram should show Availability Zone placement for every stateful component, failover arrows between primary and secondary regions, and health probe endpoints on load balancers. For Security, it should show private endpoint connections as distinct from public traffic paths, network security group boundaries around subnets, and the Azure Firewall in the hub network inspecting east-west traffic. For Cost Optimization, annotate reserved instance commitments and auto-scale ranges directly on compute resources. For Operational Excellence, include the CI/CD pipeline flow from GitHub to Azure Container Registry to AKS, and show the Log Analytics workspace collecting diagnostics from every resource. For Performance Efficiency, mark CDN edge locations and cache layers in the data path. Drawing these diagrams manually in Draw.io means placing Azure icons from Microsoft's official icon set, routing connections through the correct network topology, and aligning everything to show the logical groupings like subscriptions, resource groups, and virtual networks. Diagrams.so generates Azure architecture diagrams from text descriptions using Microsoft's official icon library. Describe your workload, select Azure as the cloud provider, and the AI places each service in the correct network topology with subscription and resource group boundaries. The output is native .drawio XML that opens in Draw.io, VS Code, or Confluence. Architecture warnings flag common framework violations like single-AZ deployments or public-facing databases without network restrictions. The diagram becomes a living reference that teams update as the workload evolves, rather than a static slide that drifts from reality within weeks.