20 April 2021

Azure Well-Architected Framework Best Practices




Microsoft Azure Well-Architected Framework

The Azure Well-Architected Framework is a set of guiding tenets that can be used to improve the quality of a workload. The framework consists of five pillars of architecture excellence: Cost Optimization, Operational Excellence, Performance Efficiency, Reliability, and Security.

To assess your workload using the tenets found in the Microsoft Azure Well-Architected Framework, see the Microsoft Azure Well-Architected Review.



Cost Optimization

When you are designing a cloud solution, focus on generating incremental value early. Apply the principles of Build-Measure-L earn, to accelerate your time to market while avoiding capital-intensive solutions. Use the pay-as-you-go strategy for your architecture, and invest in scaling out, rather than delivering a large investment first version. Consider opportunity costs in your architecture, and the balance between first mover advantage versus "fast follow". 

Cost Guidance

  • Review cost principles
  • Develop a cost model
  • Create budgets and alerts
  • Review the cost optimization checklist

A cost-effective workload is driven by business goals and the return on investment (ROI) while staying within a given budget. The principles of cost optimization are a series of important considerations that can help achieve both business objectives and cost justification.

Use the Azure Well Architected Framework to optimize your workload

To assess your workload using the tenets found in the Azure Well-Architected Framework, see the Microsoft Azure Well-Architected Review.

Keep within the cost constraints

Every design choice has cost implications. Before choosing an architectural pattern, Azure service, or a price model for the service, consider the budget constraints set by the company. As part of design, identify acceptable boundaries on scale, redundancy, and performance against cost. After estimating the initial cost, set budgets and alerts at different scopes to measure the cost. One of cost drivers can be unrestricted resources. These resources typically need to scale and consume more cost to meet demand.

Aim for scalable costs

A key benefit of the cloud is the ability to scale dynamically. The workload cost should scale linearly with demand. You can save cost through automatic scaling. Consider the usage metrics and performance to determine the number of instances. Choose smaller instances for a highly variable workload and scale out to get the required level of performance, rather than up. This choice will enable you to make your cost calculations and estimates granular.

Pay for consumption

Adopt a leasing model instead of owning infrastructure. Azure offers many SaaS and PaaS resources that simplify overall architecture. The cost of hardware, software, development, operations, security, and data center space included in the pricing model.

Also, choose pay-as-you-go over fixed pricing. That way, as a consumer, you're charged for only what you use.

Right resources, right size

Choose the right resources that are aligned with business goals and can handle the performance of the workload. An inappropriate or misconfigured service can impact cost. For example, building a multi-region service when the service levels don't require high-availability or geo-redundancy will increase cost without any reasonable business justification.

Certain infrastructure resources are delivered as fix-sized building blocks. Ensure that these blocks are adequately sized to meet capacity demand, deliver expected performance without wasting resources.

Monitor and optimize

Treat cost monitoring and optimization as a process, rather than a point-in-time activity. Conduct regular cost reviews and measure and forecast the capacity needs so that you can provision resources dynamically and scale with demand. Review the cost management recommendations and take action.

If you're just starting in this process review enable success during a cloud adoption journey .

Operational Excellence

This pillar covers the operations processes that keep an application running in production. Deployments must be reliable and predictable. They should be automated to reduce the chance of human error. They should be a fast and routine process, so they don't slow down the release of new features or bug fixes. Equally important, you must be able to quickly roll back or roll forward if an update has problems.

Monitoring and diagnostics are crucial. Cloud applications run in a remote data-center where you do not have full control of the infrastructure or, in some cases, the operating system. In a large application, it's not practical to log into VMs to troubleshoot an issue or sift through log files. With PaaS services, there may not even be a dedicated VM to log into. Monitoring and diagnostics give insight into the system, so that you know when and where failures occur. 

Configure and Manage Azure Virtual Networking

All systems must be observable. Use a common and consistent logging schema that lets you correlate events across systems.

The monitoring and diagnostics process has several distinct phases:

  • Instrumentation. Generating the raw data, from application logs, web server logs, diagnostics built into the
  • Azure platform, and other sources.
  • Collection and storage. Consolidating the data into one place.
  • Analysis and diagnosis. To troubleshoot issues and see the overall health.
  • Visualization and alerts. Using telemetry data to spot trends or alert the operations team.
  • Use the DevOps checklist to review your design from a management and DevOps standpoint.

Performance efficiency

Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an efficient manner. The main ways to achieve this are by using scaling appropriately and implementing PaaS offerings that have scaling built in.

There are two main ways that an application can scale. Vertical scaling (scaling up) means increasing the capacity of a resource, for example by using a larger VM size. Horizontal scaling (scaling out) is adding new instances of a resource, such as VMs or database replicas.

Demystifying Azure Cloud Adoption and Well-Architected Frameworks

Horizontal scaling has significant advantages over vertical scaling:

  • True cloud scale. Applications can be designed to run on hundreds or even thousands of nodes, reaching
  • scales that are not possible on a single node.
  • Horizontal scale is elastic. You can add more instances if load increases, or remove them during quieter
  • periods.
  • Scaling out can be triggered automatically, either on a schedule or in response to changes in load.
  • Scaling out may be cheaper than scaling up. Running several small VMs can cost less than a single large VM.
  • Horizontal scaling can also improve resiliency, by adding redundancy. If an instance goes down, the
  • application keeps running.

An advantage of vertical scaling is that you can do it without making any changes to the application. But at somepoint you'll hit a limit, where you can't scale any up any more. At that point, any further scaling must be horizontal.

An advantage of vertical scaling is that you can do it without making any changes to the application. But at some point you'll hit a limit, where you can't scale any up any more. At that point, any further scaling must be horizontal.

Horizontal scale must be designed into the system. For example, you can scale out VMs by placing them behind a load balancer. But each VM in the pool must be able to handle any client request, so the application must be stateless or store state externally (say, in a distributed cache). Managed PaaS services often have horizontal scaling and autoscaling built in. The ease of scaling these services is a major advantage of using PaaS services.

Just adding more instances doesn't mean an application will scale, however. It might simply push the bottleneck somewhere else. For example, if you scale a web front end to handle more client requests, that might trigger lock contentions in the database. You would then need to consider additional measures, such as optimistic concurrency or data partitioning, to enable more throughput to the database.

Always conduct performance and load testing to find these potential bottlenecks. The stateful parts of a system, such as databases, are the most common cause of bottlenecks, and require careful design to scale horizontally.

Resolving one bottleneck may reveal other bottlenecks elsewhere.

Azure Well-Architected Framework Overview

Reliability

A reliable workload is one that is both resilient and available. Resiliency is the ability of the system to recover from failures and continue to function. The goal of resiliency is to return the application to a fully functioning state after a failure occurs. Availability is whether your users can access your workload when they need to.

In traditional application development, there has been a focus on increasing the mean time between failures

(MTBF). Effort was spent trying to prevent the system from failing. In cloud computing, a different mindset is required, due to several factors:

  • Distributed systems are complex, and a failure at one point can potentially cascade throughout the system.
  • Costs for cloud environments are kept low through the use of commodity hardware, so occasional hardware failures must be expected.

Applications often depend on external services, which may become temporarily unavailable or throttle highvolume users.

Today's users expect an application to be available 24/7 without ever going offline.

Introduction to Microsoft Azure Well-Architected Framework - Vaibhav Gujral

All of these factors mean that cloud applications must be designed to expect occasional failures and recover from them. Azure has many resiliency features already built into the platform. For example:

Azure Storage, SQL Database, and Cosmos DB all provide built-in data replication, both within a region and across regions.

Azure managed disks are automatically placed in different storage scale units to limit the effects of hardware failures.

VMs in an availability set are spread across several fault domains. A fault domain is a group of VMs that share a common power source and network switch. Spreading VMs across fault domains limits the impact of physical hardware failures, network outages, or power interruptions.

That said, you still need to build resiliency into your application. Resiliency strategies can be applied at all levels of the architecture. Some mitigations are more tactical in nature — for example, retrying a remote call after a transient network failure. Other mitigations are more strategic, such as failing over the entire application to a secondary region. Tactical mitigations can make a big difference. While it's rare for an entire region to experience a disruption, transient problems such as network congestion are more common — so target these first. Having the right monitoring and diagnostics is also important, both to detect failures when they happen, and to find the root causes.

When designing an application to be resilient, you must understand your availability requirements. How much downtime is acceptable? This is partly a function of cost. How much will potential downtime cost your business?

How much should you invest in making the application highly available?

Security

Think about security throughout the entire lifecycle of an application, from design and implementation to deployment and operations. The Azure platform provides protections against a variety of threats, such as

network intrusion and DDoS attacks. But you still need to build security into your application and into your DevOps processes.

Here are some broad security areas to consider.

Identity management

Consider using Azure Active Directory (Azure AD) to authenticate and authorize users. Azure AD is a fully managed identity and access management service. You can use it to create domains that exist purely on Azure, or integrate with your on-premises Active Directory identities. Azure AD also integrates with Office365,

Dynamics CRM Online, and many third-party SaaS applications. For consumer-facing applications, Azure Active Directory B2C lets users authenticate with their existing social accounts (such as Facebook, Google, or LinkedIn), or create a new user account that is managed by Azure AD.

If you want to integrate an on-premises Active Directory environment with an Azure network, several approaches are possible, depending on your requirements. For more information, see our Identity Management reference architectures.

Protecting your infrastructure

Control access to the Azure resources that you deploy. Every Azure subscription has a trust relationship with an Azure AD tenant. Use Azure role-based access control (Azure RBAC) to grant users within your organization the correct permissions to Azure resources. Grant access by assigning Azure roles to users or groups at a certain scope. The scope can be a subscription, a resource group, or a single resource. Audit all changes to infrastructure.

Application security

In general, the security best practices for application development still apply in the cloud. These include things like using SSL everywhere, protecting against CSRF and XSS attacks, preventing SQL injection attacks, and so on.

Cloud applications often use managed services that have access keys. Never check these into source control. Consider storing application secrets in Azure Key Vault.

Data sovereignty and encryption

Make sure that your data remains in the correct geopolitical zone when using Azure data services. Azure's geo-replicated storage uses the concept of a paired region in the same geopolitical region.

Use Key Vault to safeguard cryptographic keys and secrets. By using Key Vault, you can encrypt keys and secrets by using keys that are protected by hardware security modules (HSMs). Many Azure storage and DB services support data encryption at rest, including Azure Storage, Azure SQL Database, Azure Synapse Analytics, and Cosmos DB.

Evaluate and optimize your costs using the Microsoft Azure Well-Architected Framework

Principles of cost optimization

A cost-effective workload is driven by business goals and the return on investment (ROI) while staying within a given budget. The principles of cost optimization are a series of important considerations that can help achieve both business objectives and cost justification.

To assess your workload using the tenets found in the Azure Well-Architected Framework, see the Microsoft Azure Well-Architected Review.

Keep within the cost constraints

Every design choice has cost implications. Before choosing an architectural pattern, Azure service, or a price model for the service, consider the budget constraints set by the company. As part of design, identify acceptable boundaries on scale, redundancy, and performance against cost. After estimating the initial cost, set budgets and alerts at different scopes to measure the cost. One of cost drivers can be unrestricted resources. These resources typically need to scale and consume more cost to meet demand.

Aim for scalable costs

A key benefit of the cloud is the ability to scale dynamically. The workload cost should scale linearly with demand. You can save cost through automatic scaling. Consider the usage metrics and performance to determine the number of instances. Choose smaller instances for a highly variable workload and scale out to get the required level of performance, rather than up. This choice will enable you to make your cost calculations and estimates granular.

Pay for consumption

Adopt a leasing model instead of owning infrastructure. Azure offers many SaaS and PaaS resources that simplify overall architecture. The cost of hardware, software, development, operations, security, and data center space included in the pricing model.

Also, choose pay-as-you-go over fixed pricing. That way, as a consumer, you're charged for only what you use.

Right resources, right size

Choose the right resources that are aligned with business goals and can handle the performance of the workload. An inappropriate or misconfigured service can impact cost. For example, building a multi-region service when the service levels don't require high-availability or geo-redundancy will increase cost without any reasonable business justification.

Certain infrastructure resources are delivered as fix-sized building blocks. Ensure that these blocks are adequately sized to meet capacity demand, deliver expected performance without wasting resources.

Monitor and optimize

Treat cost monitoring and optimization as a process, rather than a point-in-time activity. Conduct regular cost reviews and measure and forecast the capacity needs so that you can provision resources dynamically and scale with demand. Review the cost management recommendations and take action.

Diving deeper into Azure workload reliability (Part 1) | Well-Architected Framework

Operational excellence principles

Considering and improving how software is developed, deployed, operated, and maintained is one part of achieving a higher competency in operations. Equally important is providing a team culture of experimentation and growth, solutions for rationalizing the current state of operations, and incident response plans. The principles of operational excellence are a series of considerations that can help achieve excellent operational practices.

Azure Well-Architected Framework Operational Excellence 1

To assess your workload using the tenets found in the Azure Well-Architected Framework, see the Microsoft Azure Well-Architected Review.

DevOps methodologies

The contraction of "Dev" and "Ops" refers to replacing siloed Development and Operations to create multidisciplinary teams that now work together with shared and efficient practices and tools. Essential DevOps practices include agile planning, continuous integration, continuous delivery, and monitoring of applications.

Separation of roles

A DevOps model positions the responsibility of operations with developers. Still, many organizations do not fully embrace DevOps and maintain some degree of team separation between operations and development, either to enforce clear segregation of duties for regulated environments or to share operations as a business function.

Team collaboration

It is essential to understand if developers are responsible for production deployments end-to-end, or if a handover point exists where responsibility is passed to an alternative operations team, potentially to ensure strict segregation of duties such as the Sarbanes-Oxley Act where developers cannot touch financial reporting systems.

It is crucial to understand how operations and development teams collaborate to address operational issues and what processes exist to support and structure this collaboration. Moreover, mitigating issues might require various teams outside of development or operations, such as networking and external parties. The processes to support this collaboration should also be understood.

Workload isolation

The goal of workload isolation is to associate an application's specific resources to a team to independently manage all aspects of those resources.

Operational lifecycles

Reviewing operational incidents where the response and remediation to issues either failed or could have been optimized is vital to improving overall operational effectiveness. Failures provide a valuable learning opportunity, and in some cases, these learnings can also be shared across the entire organization. Finally, Operational procedures should be updated based on outcomes from frequent testing.

Operational metadata

Azure Tags provide the ability to associate critical metadata as a name-value pair, such as billing information (e.g., cost center code), environment information (e.g., environment type), with Azure resources, resource groups, and subscriptions. See Tagging Strategies for best practices.

Azure Well-Architected Framework Operational Excellence 2

Optimize build and release processes

From provisioning with Infrastructure as Code, building and releasing with CI/CD pipelines, automated testing, and embracing software engineering disciplines across your entire environment. This approach ensures the creation and management of environments throughout the software development lifecycle is consistent, repeatable, and enables early detection of issues.

Monitor the entire system and understand operational health

Implement systems and processes to monitor build and release processes, infrastructure health, and application health. Telemetry is critical to understanding the health of a workload and whether the service is meeting the business goals.

Rehearse recovery and practice failure

Run DR drills on a regular cadence and use engineering practices to identify and remediate weak points in application reliability. Regular rehearsal of failure will validate the effectiveness of recovery processes and ensure teams are familiar with their responsibilities.

Embrace operational improvement

Continuously evaluate and refine operational procedures and tasks while striving to reduce complexity and ambiguity. This approach enables an organization to evolve processes over time, optimizing inefficiencies, and learning from failures.

Use loosely coupled architecture

Enable teams to independently test, deploy, and update their systems on demand without depending on other teams for support, services, resources, or approvals.

Incident management

When incidents occur, have well thought out plans and solutions for incident management, incident communication, and feedback loops. Take the lessons learned from each incident and build telemetry and monitoring elements to prevent future occurrences.

Diving deeper into Azure workload reliability (Part 2) | Well-Architected Framework

Overview of the performance efficiency pillar

Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an efficient manner. Before the cloud became popular, when it came to planning how a system would handle increases in load, many organizations intentionally provisioned workloads to be oversized to meet business requirements. This might make sense in on-premises environments because it ensured capacity during peak usage. Capacity reflects resource availability (CPU and memory). This was a major consideration for processes that would be in place for a number of years.

Just as you needed to anticipate increases in load in on-premises environments, you need to anticipate increases in cloud environments to meet business requirements. One difference is that you may no longer need to make long-term predictions for anticipated changes to ensure that you will have enough capacity in the future. Another difference is in the approach used to manage performance.

What is scalability and why is it important?

An important consideration in achieving performance efficiency is to consider how your application scales and to implement PaaS offerings that have built-in scaling operations. Scalability is the ability of a system to handle increased load. Services covered by Azure Autoscale can scale automatically to match demand to accommodate workload. They will scale out to ensure capacity during workload peaks and scaling will return to normal automatically when the peak drops.

In the cloud, the ability to take advantage of scalability depends on your infrastructure and services. Some platforms, such as Kubernetes, were built with scaling in mind. Virtual machines, on the other hand, may not scale as easily although scale operations are possible. With virtual machines, you may want to plan ahead to avoid scaling infrastructure in the future to meet demand. Another option is to select a different platform such as Azure virtual machines scale sets.

When using scalability, you need only predict the current average and peak times for your workload. Payment plan options allow you to manage this. You pay either per minute or per-hour depending on the service for a designated time period.

Principles

Follow these principles to guide you through improving performance efficiency:

  • Become Data-driven - Embrace a data-driven culture to deliver timely insights to everyone in your organization across all your data. To harness this culture, get the best performance from your analytics solution across all your data, ensure data has the security and privacy needed for your business environment, and make sure you have tools that enable everyone in your organization to gain insights from your data.
  • Avoid antipatterns - A performance antipattern is a common practice that is likely to cause scalability problems when an application is under pressure. For example, you can have an application that behaves as expected during performance testing. However, when it is released to production and starts to handle live workloads, performance decreases. Scalability problems such as rejecting user requests, stalling, or throwing exceptions may arise. To learn how to identify and fix these antipatterns, see Performance antipatterns for cloud applications.
  • Perform load testing to set limits - Load testing helps ensure that your applications can scale and do not go down during peak traffic. Load test each application to understand how it performs at various scales. To learn about Azure service limits, see Managing limits.
  • Understand billing for metered resources - Your business requirements will determine the tradeoffs between cost and level of performance efficiency. Azure doesn't directly bill based on the resource cost. Charges for a resource, such as a virtual machine, are calculated by using one or more meters. Meters are used to track a resource’s usage over time. These meters are then used to calculate the bill.
  • Monitor and optimize - Lack of monitoring new services and the health of current workloads are major inhibitors in workload quality. The overall monitoring strategy should take into consideration not only scalability, but resiliency (infrastructure, application, and dependent services) and application performance as well. For purposes of scalability, looking at the metrics would allow you to provision resources dynamically and scale with demand.


Microsoft Azure Well-Architected - Joni Leskinen Sr. Cloud Solution Architect

Principles of the reliability pillar

Building a reliable application in the cloud is different from traditional application development. While historically you may have purchased levels of redundant higher-end hardware to minimize the chance of an entire application platform failing, in the cloud, we acknowledge up front that failures will happen. Instead of trying to prevent failures altogether, the goal is to minimize the effects of a single failing component.

Application framework

These critical principles are used as lenses to assess the reliability of an application deployed on Azure. They provide a framework for the application assessment questions that follow.

To assess your workload using the tenets found in the Microsoft Azure Well-Architected Framework, see the Microsoft Azure Well-Architected Review.

Define and test availability and recovery targets - Availability targets, such as Service Level Agreements (SLA) and Service Level Objectives (SLO), and Recovery targets, such as Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO), should be defined and tested to ensure application reliability aligns with business requirements.

Design applications to be resistant to failures - Resilient application architectures should be designed to recover gracefully from failures in alignment with defined reliability targets.

Ensure required capacity and services are available in targeted regions - Azure services and capacity can vary by region, so it's important to understand if targeted regions offer required capabilities.

Plan for disaster recovery - Disaster recovery is the process of restoring application functionality in the wake of a catastrophic failure. It might be acceptable for some applications to be unavailable or partially available with reduced functionality for a period of time, while other applications may not be able to tolerate reduced functionality.

Design the application platform to meet reliability requirements - Designing application platform resiliency and availability is critical to ensuring overall application reliability.

Design the data platform to meet reliability requirements - Designing data platform resiliency and availability is critical to ensuring overall application reliability.

Recover from errors - Resilient applications should be able to automatically recover from errors by leveraging modern cloud application code patterns.

Ensure networking and connectivity meets reliability requirements - Identifying and mitigating potential network bottle-necks or points-of-failure supports a reliable and scalable foundation over which resilient application components can communicate.

Allow for reliability in scalability and performance - Resilient applications should be able to automatically scale in response to changing load to maintain application availability and meet performance requirements.

Address security-related risks - Identifying and addressing security-related risks helps to minimize application downtime and data loss caused by unexpected security exposures.

Define, automate, and test operational processes - Operational processes for application deployment, such as roll-forward and roll-back, should be defined, sufficiently automated, and tested to help ensure alignment with reliability targets.

Test for fault tolerance - Application workloads should be tested to validate reliability against defined reliability targets.

Monitor and measure application health - Monitoring and measuring application availability is vital to qualifying overall application health and progress towards defined reliability targets.


Azure Reliability


Overview of the security pillar


Information Security has always been a complex subject, and it evolves quickly with the creative ideas and implementations of attackers and security researchers. The origin of security vulnerabilities started with identifying and exploiting common programming errors and unexpected edge cases. However over time, the attack surface that an attacker may explore and exploit has expanded well beyond that. Attackers now freely exploit vulnerabilities in system configurations, operational practices, and the social habits of the systems’ users. As system complexity, connectedness, and the variety of users increase, attackers have more opportunities to identify unprotected edge cases and to “hack” systems into doing things they were not designed to do.

Security is one of the most important aspects of any architecture. It provides confidentiality, integrity, and availability assurances against deliberate attacks and abuse of your valuable data and systems. Losing these assurances can negatively impact your business operations and revenue, as well as your organization’s reputation in the marketplace. In the following series of articles, we’ll discuss key architectural considerations and principles for security and how they apply to Azure.

Security design principles

These principles support these three key strategies and describe a securely architected system hosted on cloud or on-premises datacenters (or a combination of both). Application of these principles will dramatically increase the likelihood your security architecture will maintain assurances of confidentiality, integrity, and availability.

Each recommendation in this document includes a description of why it is recommended, which maps to one of more of these principles:

Align Security Priorities to Mission – Security resources are almost always limited, so prioritize efforts and assurances by aligning security strategy and technical controls to the business using classification of data and systems. Security resources should be focused first on people and assets (systems, data, accounts, etc.) with intrinsic business value and those with administrative privileges over business critical assets.

Build a Comprehensive Strategy – A security strategy should consider investments in culture, processes, and security controls across all system components. The strategy should also consider security for the full lifecycle of system components including the supply chain of software, hardware, and services.

Drive Simplicity – Complexity in systems leads to increased human confusion, errors, automation failures, and difficulty of recovering from an issue. Favor simple and consistent architectures and implementations.

Design for Attackers – Your security design and prioritization should be focused on the way attackers see your environment, which is often not the way IT and application teams see it. Inform your security design and test it with penetration testing to simulate one-time attacks. Use red teams to simulate long-term persistent attack groups. Design your enterprise segmentation strategy and other security controls to contain attacker lateral movement within your environment. Actively measure and reduce the potential attack surface that attackers target for exploitation of resources within the environment.

Leverage Native Controls – Favor native security controls built into cloud services over external controls from third parties. Native security controls are maintained and supported by the service provider, eliminating or reducing effort required to integrate external security tooling and update those integrations over time.

Use Identity as Primary Access Control – Access to resources in cloud architectures is primarily governed by identity-based authentication and authorization for access controls. Your account control strategy should rely on identity systems for controlling access rather than relying on network controls or direct use of cryptographic keys.

Accountability – Designate clear ownership of assets and security responsibilities and ensure actions are traceable for nonrepudiation. You should also ensure entities have been granted the least privilege required (to a manageable level of granularity).

Embrace Automation - Automation of tasks decreases the chance of human error that can create risk, so both IT operations and security best practices should be automated as much as possible to reduce human errors (while ensuring skilled humans govern and audit the automation).

Focus on Information Protection – Intellectual property is frequently one of the biggest repositories of organizational value and this data should be protected anywhere it goes including cloud services, mobile devices, workstations, or collaboration platforms (without impeding collaboration that allows for business value creation). Your security strategy should be built around classifying information and assets to enable security prioritization, leveraging strong access control and encryption technology, and meeting business needs like productivity, usability, and flexibility.

Design for Resilience – Your security strategy should assume that controls will fail and design accordingly. Making your security posture more resilient requires several approaches working together:

Balanced Investment – Invest across core functions spanning the full NIST Cybersecurity Framework lifecycle (identify, protect, detect, respond, and recover) to ensure that attackers who successfully evade preventive controls lose access from detection, response, and recovery capabilities.

Ongoing Maintenance – Maintain security controls and assurances to ensure that they don’t decay over time with changes to the environment or neglect.

Ongoing Vigilance – Ensure that anomalies and potential threats that could pose risks to the organizations are addressed in a timely manner.

Defense in Depth – Consider additional controls in the design to mitigate risk to the organization in the event a primary security control fails. This design should consider how likely the primary control is to fail, the potential organizational risk if it does, and the effectiveness of the additional control (especially in the likely cases that would cause the primary control to fail).

Least Privilege – This is a form of defense in depth to limit the damage that can be done by any one account. Accounts should be granted the least amount of privilege required to accomplish their assigned tasks. Restrict the access by permission level and by time. This helps mitigate the damage of an external attacker who gains access to the account and/or an internal employee who inadvertently or deliberately (for example, insider attack) compromises security assurances.

Baseline and Benchmark – To ensure your organization considers current thinking from outside sources, evaluate your strategy and configuration against external references (including compliance requirements). This helps to validate your approaches, minimize risk of inadvertent oversight, and the risk of punitive fines for noncompliance.

Drive Continuous Improvement – Systems and existing practices should be regularly evaluated and improved to ensure they are and remain effective against attackers who continuously improve and the continuous digital transformation of the enterprise. This should include processes that proactively integrate learnings from real world attacks, realistic penetration testing and red team activities, and other sources as available.

Assume Zero Trust – When evaluating access requests, all requesting users, devices, and applications should be considered untrusted until their integrity can be sufficiently validated. Access requests should be granted conditionally based on the requestor's trust level and the target resource’s sensitivity. Reasonable attempts should be made to offer means to increase trust validation (for example, request multi-factor authentication) and remediate known risks (change known-leaked password, remediate malware infection) to support productivity goals.

Educate and Incentivize Security – The humans that are designing and operating the cloud workloads are part of the whole system. It is critical to ensure that these people are educated, informed, and incentivized to support the security assurance goals of the system. This is particularly important for people with accounts granted broad administrative privileges.

Azure Well-Architected Framework Security Overview

Overview of a hybrid workload

Customer workloads are becoming increasingly complex, with many applications often running on different hardware across on-premises, multicloud, and the edge. Managing these disparate workload architectures, ensuring uncompromised security, and enabling developer agility are critical to success.

Azure uniquely helps you meet these challenges, giving you the flexibility to innovate anywhere in your hybrid environment while operating seamlessly and securely. The Well-Architected Framework includes a hybrid description for each of the five pillars: cost optimization, operational excellence, performance efficiency, reliability, and security. These descriptions create clarity on the considerations needed for your workloads to operate effectively across hybrid environments.

Adopting a hybrid model offers multiple solutions that enable you to confidently deliver hybrid workloads: run Azure data services anywhere, modernize applications anywhere, and manage your workloads anywhere.

Extend Azure management to any infrastructure

 Tip

Applying the principles in this article series to each of your workloads will better prepare you for hybrid adoption. For larger or centrally managed organizations, hybrid and multicloud are commonly part of a broader strategic objective. If you need to scale these principle across a portfolio of workloads using hybrid and multicloud environments, you may want to start with the Cloud Adoption Framework's hybrid and multicloud scenario and best practices. Then return to this series to refine each of your workload architectures.


Use Azure Arc enabled infrastructure to extend Azure management to any infrastructure in a hybrid environment. Key features of Azure Arc enabled infrastructure are:

Unified Operations

  • Organize resources such as virtual machines, Kubernetes clusters and Azure services deployed across your entire IT environment.
  • Manage and govern resources with a single pane of glass from Azure.
  • Integrated with Azure Lighthouse for managed service provider support.
  • Adopt cloud practices

Easily adopt DevOps techniques such as infrastructure as code.

Empower developers with self-service and choice of tools.

Standardize change control with configuration management systems, such as GitOps and DSC.


More Information:

https://www.microsoft.com/en-us/us-partner-blog/tag/well-architected-framework/

https://docs.microsoft.com/en-us/learn/paths/azure-well-architected-framework/

https://www.microsoft.com/azure/partners/well-architected#well-architected-framework

https://azure.microsoft.com/en-us/blog/introducing-the-microsoft-azure-wellarchitected-framework/

https://docs.microsoft.com/en-us/azure/architecture/framework/

https://www.microsoft.com/en-us/us-partner-blog/2021/01/26/an-introduction-to-azures-well-architected-framework/

https://www.capgemini.com/2020/10/microsoft-azure-well-architected-framework/

https://docs.microsoft.com/en-us/azure/cloud-adoption-framework/innovate/considerations/adoption

https://azure.microsoft.com/en-us/features/reliability/#features

https://docs.microsoft.com/en-us/azure/cost-management-billing/

https://azureinfohub.azurewebsites.net/Service/Videos?serviceTitle=Azure%20Cost%20Management

https://docs.microsoft.com/en-us/azure/architecture/

https://cloudsecurityalliance.org/blog/2020/08/26/shared-responsibility-model-explained/

https://docs.microsoft.com/en-us/azure/architecture/browse/

https://visualstudiomagazine.com/articles/2020/08/04/azure-well-architected-framework.aspx

https://docs.microsoft.com/en-us/azure/architecture/framework/scalability/test-checklist

https://docs.microsoft.com/en-us/learn/modules/azure-well-architected-security/


Share:

0 reacties:

Post a Comment