Yes, here we go again: listings of ten characteristics, advantages or attributes, yadda, yadda, yadda. Even I am not a fan of these types of articles. I have recently been transferred from one company to another and as a result of it I had to switch almost all of my cloud work from AWS to GCP. I must say that it has been an exciting journey so far, so I decided to point out some things where these two platforms differ, and I hope that it might be useful to others.
Please, keep in mind that I am writing this in the middle of 2020, so things can be a little different in the future whenever you get to read this piece.
Also, I would like to stress out that these are my findings in the current situation so that yours can be totally different, or you can find mine invalid. Feel free to correct me in the comments below. I love being wrong!
What GCP is missing ...and so do I
I would like to start with one of the most apparent things I miss in GCP: the astonishing amount of the services you can use in AWS! Let me explain to you why.
Let's say you would like to use a basic Elasticsearch service. Nothing fancy, no plugins. The customer is willing to pay for the service since he knows your time can be used somewhere else for something more substantial. Well, in GCP, tough luck.
The correct way would be to analyze your application and use other GCP services. But if you would like to use Elasticsearch, you will need to deploy your GCE instances and maintain them. To show that we are struggling with precisely that very problem, here is a terraform repository implementing just that https://github.com/AckeeCZ/terraform-gcp-elasticsearch and this is only one of the many examples.
I am aware of the fact that the Elasticsearch from AWS is not the same thing as the Elasticsearch managed by you. The service lacks plenty of plugins, and it is relatively expensive for a small project. I guess the main reason the service is not that open for any user settings is keeping up with very demanding SLA policies provided by AWS. But, let's face it, who wouldn't rather give up some plugins to bring overall stability for the application? And moreover, Elasticsearch is not the only limited service in AWS. Plenty of functions in MySQL AWS Aurora are also disabled or mapped from other functions, which could be called by non-root users. We can only imagine why. My guess would again be a very strict SLA.
Missing service? It is up to you to decide
As an architect, you should be the one who decides if missing service is a problem for your application. If so, switch to almighty AWS with all of its issues or spend some more time analyzing your app. You might have made some wrong architectural decisions and you should instead use some other GCP service once you adjust your application. To sum up this first issue: once you miss the service you would like to work with in GCP, there is a fair chance you are doing it wrong. If your application couldn't be bent into GCP services, then it's the right time to use AWS.
The second issue that I discovered during the migration from one cloud to the other is the difference between the entities' names where your resources lie. In AWS, your resources are in the account, whereas in GCP, there are projects. IMHO having a project makes much more sense than accounts. But it all depends on the angle from which you are looking at your infrastructure. If you are lifting and shifting into AWS, your servers are probably pets financed by the company that owns them. Sure, it makes perfect sense. Your account is cherished and well maintained. Once you develop an app that should be deployed fast and discarded even faster, you shouldn't need an account with all the billing info and its uniqueness; you have a project.
Let's consider this wild scenario: you have one account. The account has its billing. It's not a part of an organization. Your terraform code is terrible. It tends to name the resources without random postfix, so if you want to replicate things, it always ends up with duplicates of the resources that couldn't be created. So you cleverly and rightly decided to make a (regional) workaround and deploy the developing environment to Frankfurt, the stage to London, and the production into Ireland. Great! You would be surprised how many times this has happened. Instead of creating multiple accounts (or projects), people just keep switching between regions. For GCP, the terminology, in this case, is much cleaner. You have various projects, and that's it. Would you like to have a production environment? No problem; just deploy it into a different project and link it to the correct billing account.
Different identities, different flow
Let's address the differences in the cloud's most essential service – AWS IAM vs. GCP IAM. First, I would like to clarify something: GCP IAM is not IAM, and there is a simple reason for this – GCP IAM does not provide identities. It uses identities from G Suite or any other email provider. And there is something else I would like to point out about GCP, it has something called service accounts. You probably have guessed that those accounts are used to identify the services (which makes total sense, BTW). In AWS, you have something similar, of course; it's called roles. Be aware – the flow is a bit different.
For example: In AWS, you assign a role to an EC2 instance. Its instance-id identifies the instance. The role contains permissions, which are assumed (determined) by the instance. In GCP GCE, you assign a service account to the instance. The account identifies the instance. The service account has assigned permissions to it. I am sure you have noticed that there are clearly some problems with the terminology. Service accounts make much more sense. The problem is getting worse every time the clouds collide in some common service:
- In AWS, you can have K8s cluster, which also contains service accounts.
- In GCP's K8s, this is getting a bit simpler, and usually, the service accounts are mapped to IAM service accounts.
And that's not all. Let's not forget that GCP instances almost always have default service account compared to EC2 instances that do not have any default roles. Piece of cake, right? Even while writing this paragraph, I am sure my terminology is only partially correct. The confusion could be misleading for new users.
Once you need more pods, you are probably using it wrong
And now, let's not forget about the most hyped up service of the last few years – I sure won't... AWS K8s is called EKS, and GCP K8s is called GKE. To be honest, K8s was developed mostly by Google, so it is unfair to compare GKE to bit younger EKS. Let me also say that the situation with EKS is getting better every day. I was the unlucky one who had to start his K8s journey with EKS, which was just launched back then.
The first problem I encountered was the network overlay because there was none. We deployed four small nodes just to test the whole thing. To see how the EKS operates, we also deployed about a hundred pods. Surprise, surprise ... almost every pod was pending and unschedulable. Of course, our scenario was stupid. Nobody should do it the same way. Your node count and node capacity should reflect the needs of microservices running in the pods. Therefore the count of IP addresses should also reflect the node capacity and count of the nodes.
Let's not overlook the fact that one pod can have only one IP address, and AWS does not enable any network overlay by default. You have only as many IP addresses as your instance type allows you to. In our case, it was around ten IP addresses.
The worst case could be the limits in horizontal upscaling. Once you run out of instances, you will later run out of IP addresses, and your scaling stops. Try to explain this to an on-call support technician who is unable to scale up in the middle of the night because some other namespace took all available IP addresses in the cluster. My point and my advice to you are that using a native AWS network overlay could be a pain. It has its moments, but for testing and development, just use Calico.
Not a healthy way to handle health checks
Another not fun feature of the GCP cloud is the way the cloud handles health checks. I know that Google offers an absolutely unique solution to the Cloud VPC, and let's face it; we are taking it for granted. Just imagine: the VPC covers the whole world, but why the health checks have to be executed only from particular subnets? I do not know. It would be more practical to do it from the load balancer, IMHO.
Furthermore, the subnets are different for l4 and l7 load balancers. Everyone creates one firewall rule with all of the subnets just to get it done. It's really not ideal. It creates a lot of questions that the junior engineers often ask. In an ideal world, I would suggest sending health checks from the load balancer, but I am aware that they are very unique. It took me days to figure out a simple internal managed load balancer. Not that I can't create them in a cloud console, but building a working terraform code for any GCP load balancer is a job for competent engineers.
We have already discussed how IAM in AWS and GCP differ a bit. What if I told you that if you want to enlarge the quota for any of the GCP resources, you have to go to IAM settings? Weird, isn't it? In this case, I would say that AWS is much more straightforward. I still think that this should be a part of the billing, not IAM. But I guess it's just a matter of perspective.
The game of value
Instead of throwing dirt on both AWS and GCP, I would rather finish this small article with what is essential for you to know. The use case drives the infrastructure. The use case itself defines plenty of conditions you should keep in mind. Write them on the board, assign the services which could fulfill those conditions, and then wisely choose which way you want to go. Of course, you can make your architecture heterogeneous, and for some use cases, that would make perfect sense. For example, you would like to use mighty machine learning services from GCP, but you are pretty much ok with keeping your React frontend on the AWS S3 bucket.
You always have to remember: Is the cost of operations on such a heterogeneous design worth the overall value? And I am not talking only about day-to-day labor like updating the nodes in the K8s. You should also consider how wide the knowledge base of the operations team has to be. Without a doubt, we can say that it is cheaper to have just one cloud provider.