AWS Well Architected Framework
Excellence Pillars
Operational Excellence Pillar - continuous evaluation if system is operating with correct architecture, uses correct services, supports correct amount of reliability and security.
Security Pillar - organisations should design their systems with security in mind. There are several security strategies which can help with that.
Defense in Depth - several advises on how to keep application secure: keep each layer of application in separate subnet; keep subnets unaccessible from internet; have authentication and autherisation required for communicating between layers; interaction between subnets should go through firewall; give minimal possible permissions to components of application; encrypt data in transit and at rest.
Reliability Pillar - it's important to understand desirable availabilit level of services. Too low will cause significant disruption to business, and too high requires applying significant efforts to achieve: specific architecture, redundancy, very cautious releases. If system is dependent on service without redundancy, it is hard dependency
. If dependency has high availability, then it can be called soft dependency
.
- How to calculate availability of system, which consists of two hard-dependent on each other services. Each service availability is 99.95%:
- Calculate chance of failure of each service:
1 - 0.9995 = 0.0005
- Calculate its sum:
0.0005 + 0.0005 = 0.001
- Substract from 1:
1 - 0.001 = 0.999
, which is99.90%
- Calculate chance of failure of each service:
- How to calculate availability of system, which has two redundancy areas. Each area availability is 99.95%:
- Calculate chance of failure of each area:
1 - 0.9995 = 0.0005
- Calculate its sum:
0.0005 * 0.0005 = 0.00000025
- Substract from 1:
1 - 0.00000025 = 0,99999975
, which is99,9999%
- Calculate chance of failure of each area:
Performance Efficiency - clouds can assist you with any way of tuning the performance of the system. It is possible to easily vertically scale infrastructure by changing specs of EC2 and other services to more powerful configuration. Also it is relatively easy to scale horisontally - you can spin up more instances on demand, many AWS services provide multi-node configurations, with read-only, and fully-operational instances.
Cost Optimisation Pillar - moving to the cloud should be cost-saving operation and not vice-versa. AWS provides many cost calculators, including machine-learning based in order to project future expenses. In case of moving from on-prem to the cloud it is important to understand costs of such investment and projected point of return of investments.
Sustainability Pillar - try to keep sustainability in mind. Preferably deploy applications to the areas, which electric grid has lower CO2 emissions. Try to utilize minimum required amount of resources. Watch CloudWatch to ensuring, that right amount of resources is used.
SLA
Services should have definded SLIs, SLOs, and possibly SLAs.
SLI - Service-layer Indicator - exact metrics choosen for measuring performance of the service. Example: latency on p99, req/s, errors/hr, CPU utilisation, memory utilisation.
SLO - Service-level Objectives - assigned SLI targets. SLOs are more granular metrics than SLA, SLOs can not be weaker than SLA. Example: latency on p99 should be less than 500ms, 0 errors per hour, CPU utilisation more than 30% and less than 75%
SLA - Service-layer Agreement - commitment for providing service of expected level of quality. Often legal-obiding.
In addition to service-level goals, it would be beneficial to define targets for recovery in case of disaster.
RTO - recovery time objective - maximum amount of time which can be taken to recovery service.
RPO - recovery point objective - maximum amount of data which can be lost during disaster.
Geographical placement
Remember abut geographical placement of the servers. Ther closer users to the servers - the better latency would be there. Also, some countries and provincies have strict data locality laws, which also should be taken into consideration.
Key parts of the architecture which geographical position can affect user experience:
- Host location
- Data caching (CDN) - AWS CloudFront, Cloudflare
- Data replication - within availability zone or between availability zones inside primary region, or between primary regions
- Load balancing - Route 53. Configure depending to metrics of the node, liveness, session stickiness and affinity.
- Failover recovery scenarios - Components should automatically detect failed nodes and redirect traffic to healthy ones.
The Twelve-Factor App
- One versioned codebase used for many deployments - AWS has CodeCommit - repository for code and binary files.
- Dependencies - Explicitly declare and control dependencies. Again, AWS provides several tools for building images and distributing artifacts.
- Configuration in environment - while code should stay the same, chages from ENV to ENV should be provided separately. AWS has such services helping with managin different aspects of configuration: Secrets Manager(credentials, API Keys, OAuth tokens), Certificate Manager (SSL/TLS, X.509), Key Management Systerm (for encryption keys), CloudHSM (single-tenant hardware security modules), Systems Manager Parameter Store (config storage like Consul).
- Treat Backing Services as Attached Resources - Resources, such as DB, S3, MQ should always be treated like remote, swappable resources.
- Separate Build and Run Stages - Build and Run should be separate stages with artifacts. AWS has several CI/CD tools to help buld pipeline: CodePipeline, CodeDeploy, CodeBuild.
- Execute App as Stateless Process
- Export Services via Port Binding
- Scale out via Process Model - use jobs utilising OS processes for scaling application. Share nothing between jobs. That will allow to easily scale application to separate nodes if needed.
- Maximize Robustness - applications should startup fast - that will improve scalability and maintainance. App shoud be able to shut down fast gracefully - all current jobs should get finished, if designed so. App should be able to handle sudden process kill - all jobs should be returned to queue, thus jobs should be transactual and reentrant. AWS can help here with: serverless functions as a jobs; SQS - queue for keeping the jobs.
- Keep All Environments Same - keep LOCAL, DEV, PREVIEW and PROD as similar as possible. Use same libraris, same databases.
- Treat Logs as Event Streams - don't bother with writing logs to filesystem. Logs should be emitted to STDOUT, handled by app environment and stored in appropriate storage like EKS, Hadoop, S3. In AWS take a look at CloudWatch.
- Ad-Hoc Tasks Should be Done Same Way As Production - sometimes there is a need for an administrative or ad-hoc tasks. Ideally, make such functional as part of application - awailable via admin endpoints or GUI. In worst case scenario, keep admin scripts in the same repository as main code, and deliver it with automated deployment.