It wouldn’t be a year in DevOps if it wasn’t filled with thrilling support issues, infrastructure blunders and unexpected situations in regards to the current world situation. During the old days, I had a feeling that the Internet was always running. Ever since I joined the workforce, I was proven time and time again that the Internet is fragile and it has problems. Just this year, GCP reported 121 issues.
But let’s be honest. If you are reading this, you are most probably someone who is working as SRE or as a DevOps engineer. All of us are doing this job not because it is easy but because it is hard. We like the challenge, the feeling that we are able to fix something. Who else can say that about their job?
This year we faced a lot of challenges. I am happy to say my team passed all of them with flying colours. Even though our team lost almost half of its members to the new Ackee branch, we were still able to manage. Two out of five of our colleagues chose to switch to the Ackee Blockchain. We tried our best to support their professional decision even though sometimes it meant a lot of sacrifices. But we managed! For that, I am grateful. I guess it’s Ackee's priority to make all the employees work in the field they find most important.
But how well and how many issues did the DevOps department manage to handle? Let’s see:
Ackee DevOps 2021 in numbers
DevOps provide not only support. We should work at least for more than 50 % of our time on something creative. It is vital for our mental health to create new things. Those could be related to automation, programming itself or just to improve documentation.
There is a lot of stuff we like to work on but one thing I would like to mention is our terraform modules. In the year 2021, we worked on 14 terraform modules, those got downloaded ~62k times. Our most starred module is terraform-gcp-elasticsearch. For us, terraform is the most used code we maintain, therefore those numbers are not a surprise. We do appreciate that others are also interested in our code. What worries me is that there are not that many open issues in the modules we develop. We are sure that the modules are not perfect, therefore we would be happy to fix something once in a while.
Let’s talk about the support we give to our beloved developers: We are responsible for a lot of things. Just out of many: pipelines, IAM setup, cloud infrastructure & updates. Once a developer has a problem with anything we are responsible for, he writes to our #support Slack channel and we will see if there is something we can do about it. In the year 2021, we exchanged 2266 messages in #support channel. If you imagine we currently have 70+ members in our company, it doesn’t seem to be that many. The number is similar to the count we had in the year 2020.
Once we start to talk about the hours we spend on support, that’s something I found particularly interesting. During the year 2021, we spent around 1000 hours on support. If you consider that the DevOps team spent 6k on other work, it doesn’t sound too shabby.. More than 50 % of our time was given to something other than support.
We also provide 24/7 support to two other projects. Quite frankly, it was the most demanding part of our job, but that’s for what we are here for. We do not colour buttons on the front end, we do not create backends for saying who has a birthday either, we make things reliable for eleven nines in a row! (For non-SRE people, I mean 99.999999999% reliability). Overall, we managed to solve 546 alerts, which is not too bad! That means almost two alerts a day, day and night. I am proud of my team for that. We managed! Nobody wanted to give up, everybody wanted to suffer the same others did. Thank you! Thank you all for that.
DevOps goals we had last year
Last year, I was talking about new features we would like to put into our DevOps stack. Those were:
- Service mesh,
- Improvement of the terraform code for the latest versions.
Surprisingly, service mesh wasn’t the hype that we thought it was going to be. We chose widely supported and widely spoken ISTIO. It proved its reputation as being very hard to understand and manage. It was just a bit too much in multiple ways:
First of all, the overall cost of our GKE clusters was raised by 10 % for each cluster just to deploy ISTIO. That might be an issue if the customer watches his bills carefully.
Secondly: As I mentioned last year: “We would deploy anything else, but ISTIO is the only supported platform in the GKE cluster” Back then, ISTIO support was in beta and later deprecated. GKE now offers Anthos Service Mesh. Quite frankly, Google, who the hell gives the rat ass unless you offer reasonable SLA.
The whole Anthos idea is getting ridiculous. Can you imagine how you are going to explain service mesh to your customers? “There is Linkerd, Kong, ISTIO which are widely established, but the only thing I can offer you on GKE is Anthos Service Mesh.” Thank you, Google, once again for totally missing the target customers.
Anthos is advertised as a hybrid platform allowing on-premise workloads to interact with a cloud environment. If there is still a company using on-premise in the year 2022, do you think they would care about service mesh? I hardly think so.
Overall, we decided that almost none of our workloads could profit from service mesh (with one or two exceptions). For frontend apps hidden behind CDN, a simple KNative platform would work fine, others can profit from Argo CD workflow. GCP supports KNative in Cloud Run.
I also mentioned Terraform. For plenty of reasons, adopting new features in Terraform is like opening candy. Somehow, I couldn’t even comment that much as I did for service mesh. The reason might be because Hashicorp is doing a good job in maintaining Terraform. Thank you, Hashicorp.
Reasonable stack for midrange projects
Last year I was writing about deploying your React applications to the Firebase Hosting. This is a fairly simple way to deploy a web application without any hustle. This year I am going to add one more hint: you can also map your Cloud Run services to the same domain of the Firebase Hosting. This is great for mid-range applications. You no longer need to wait for the HTTP OPTIONS CORS requests, it is all handled under the same domain (but under a different path).
Furthermore, you don’t need to manage any instances under Cloud Run as you would for GKE (if you are not using autopilot). Let’s endorse NoOps and not provision heavy platforms for small apps. If I would like to recommend any managed KNative platform, Cloud Run would be my first choice. The only thing I currently miss is Prometheus support. Otherwise, if you don’t need a full SLA (a lot of parts are still in beta), Cloud Run should be your best option.
Ackee DevOps in 2022
Let’s face it: there is hardly anything suggesting the year is not going to suck. COVID is still in full swing. We are still going to face issues never seen before. Buckle up! But there are plenty of reasons that make me say I do not worry: I have a wonderful team full of dedicated hard-working people on the same page as I am. Also, I am a member of a wonderful company understanding its own DevOps team. That is very important, because as netmeister says:
2022, we are not scared, we are ready and we will manage! And the same goes for Ackee as well.
If you found this post interesting and you would like to be part of our DevOps adventures, don’t hesitate to reach out.