How to chose right services in AWS — An example case study

Madhava Keshavamurthy
8 min readJan 26, 2021

If you are either building a new application in the cloud or willing to migrate your existing application to the public cloud you would obviously get confused with the lot of options and offerings from different cloud players. Adding to the woes, there will be different offerings from same vendor. So the overall process of choosing the right set of resources would become challenging and confusing.

The purpose of this write up is to throw some lights on various factors that needs to be considered while building or migrating an application into the public cloud. At the end of this read, one should be able to take informed decisions whenever they do something in the public cloud. In this write up I would be considering AWS as the public cloud and would be comparing different offerings from AWS.

AWS today has variety of services and most of them are available across different regions. However when it comes to choose the right set of resources to have your application in cloud, you need to know clearly the answers to the below questions.

  1. Is the service right choice for my application?
  2. Is this able to handle fault tolerance?
  3. Is this secured?
  4. Is this available globally or across regions?
  5. Is this scalable seamlessly?
  6. Oh gosh!! is it easy to manage?
  7. Last but not the least, is this priced within our capex and opex?
  8. And many more!!!!

Before you actually get into the design aspects of your application, if you can answer the above set of questions from different perspectives, it would enable you to arrive at the better designs and code.

Business Use case— Example Case study

In this article I would focus on a use case and write about various options and offerings in AWS that can serve the purpose.

The use case here is to have an application in the cloud which would expose an API to the customers. The application should be supporting

  1. Multi-tenancy (more than one customer)
  2. Highly Available
  3. Scalable based on the need
  4. Easy to manage and operate
  5. Should not leave a huge hole on opex and capex

With the above set of requirements given, below are some of the standard stuffs that needs to be done at the beginning.

  1. Get the current and future traffic pattern

Assume that to start with there is a requirement to support around 500RPS (Request Per Second) and it might go as high as 10000 RPS

2. KYC — Know Your Customer

Try to get the understanding of your customer base, their region from where they will be generating the traffic and also how important this application is for the customer — for example what happens if a request is dropped, its implications and consequences. This would help you to start focussing on few regions to begin with and scale to other regions at later point in time — Savings on capex and opex.

With these information in hand, let’s just try to see how our stack would look like at a very high level and break it into pieces to solve a larger problem.

The architecture will have the below components in a high level.

  1. Route-53 — To provide the address of the load balancer
  2. Load balancers
  3. Application Process

As we can see, there is a customer who will be the accessing the API from his trusted network. As the traffic flows, it enters into the public cloud and the API is placed behind Route-53 for getting the load balancer address which would forward the request to application process. Let me delve into each of these components in depth now.

Options for Load balancers/Entry Point

  1. AWS API Gateway (APIGW)
  2. AWS Network Load Balancer (NLB)
  3. AWS Application Load Balancer (ALB)

Comparison of various features for the above services tabulated below.

Feature comparisons

As we can see from the table, all these services are available in multi-region and can be deployed in multiple availability zones within a region. So, if you deploy them across multiple regions with a reference to them under route-53, you can make your application/API highly available. If one region has some failures, through Route53 based health check, you can make route-53 to point to some other region.

Though we write highly efficient multi threaded asynchronous API, some of the factors like TLS handling, custom authorisation etc will add to the latency of the API. For example, if we chose to terminate the TLS connection on the application container, for every connection request, TLS negotiation has to be done with in the application container. If very complicated algorithm is used during TLS handshake, it also takes some amount of CPU from the application container. This will add up on the API response time.

On similar lines if we need to validate the token/key passed by the API caller in the application every time, it adds to the response time as well. So if there is an option of caching the validated token which would be usually JWT or similar structure, it would certainly save some resources in terms of CPU on the application container. Hence the choice of underlying infra should be made such that the load on the application container can be minimised and can be used mostly for doing what API is supposed to do.

Another important aspect is about how easily it can scale out and scale in. Under sustained load, the choice should scale out with little or no intervention. In case of APIGW, if we want to support beyond 10K RPS, we need to inform the AWS support team and get it activated. But such scale out is easily managed in NLB and ALB. So, manageability will also be simple.

Last but most important factor is cost. Just because we get everything managed by AWS, and overall operations and deployments are easy, we should also have cost factor in mind. Over a period of time, it should not cost a fortune for the business. One should look forward for a good balance between manageability and cost. The cost calculators for the above mentioned services can be found at the references section.

Options for running the application process

For running the applications there are again plenty of options available.

  1. EC2 based containers (ECS)
  2. ECS Fargate
  3. EKS Fargate
  4. Lambda

We can find comparisons across these services in the below table.

Feature Comparison

As we can see, there are some pros and cons with each of these choices. For a large amount of time, we all are familiar with EC2 based servers being used to host the application. However, it might not be suitable for applications which needs to scale out and scale in seamlessly. Also, in order to support fault tolerance across multiple AZ, one need to have multiple EC2 instances up and running across AZ’z. So, cost would also be significantly higher to support that.

However, the improved version of ECS which is Fargate is very simple to use and easy to manage because it is almost similar to server-less, and all the security fixes and patches will get to the container infra automatically by AWS. Also to scale up, it just takes to pull the image from ECR onto the system and starting the application. So, scales up faster. But cost-wise it might be a bit costlier than EC2 but with the offerings of easy manageability and scalability, this might suit more for some of the use cases.

With EKS fargate, it is also similar to ECS Fargate. However in ECS, it is suitable for a workload which is homogeneous. We can have only single image as part of ECS. However with EKS, you can have different types of images and hence heterogeneous service can be supported.

Where as lambda is server-less and pretty easy to manage. However, it is mostly suitable for smaller workloads. Often we have also seen lambda being timed out if the resources are not properly allocated for it or if it is used to do some heavy weight operations such as large company employee database management etc. Also one more factor you should be cautious while using lambda is about its memory and shared file system /tmp space. In order to avoid the cold-start time, some times lambda re-uses the execution environment of previous invocation. So, if at all the application is doing any stateful operation using lambda memory and file system, one should take precautions to clear up these spaces before using in a new invocation.

Cost is also important factor, and the reference of cost calculator can be found at the end.

Some of the Best Practices

  1. As mentioned in the beginning, try to get as many information as possible about the traffic pattern.
  2. To start with, use the resources that are absolutely needed to support the requirements.
  3. As and when you learn through the traffic pattern, tune your systems/services to adapt to the traffic pattern. For example, if you clearly know that you have a periodic traffic pattern for specific duration, you can have your services scaled up just before the peak traffic and there by cater to the need and also keep the cost at optimum.
  4. To start with you need not have your deployment across the world. Just have deployments close to the customer region so that their traffic can be served with in the region. Also, scale out as and when the customer adoption increases.
  5. Do not make choices just because it’s a hot topic in technology or the latest topic. Chose the one that impacts the business largely.

Summary

In this article we discussed about what are the various factors which can yield different results for the same code that we write. The factor could range anywhere from availability, scalability, manageability to cost. I hope with this article one would be able to accommodate these factors while building their application in the cloud and make informed decisions and choices for their application.

References

https://docs.aws.amazon.com/lambda/index.html

https://docs.aws.amazon.com/elasticloadbalancing/

https://docs.aws.amazon.com/apigateway/index.html

--

--