DevOps Days Floripa
This past weekend I had the opportunity to attend the DevOpsDays hosted here in Florianópolis, a world-wide DevOps event. I thought it might be helpful for me and others to highlight some of the talks here.
TL;DR
- DevOps is not a team or a job position or a career, it’s a culture
- As a culture, your team and your company should be ready before trying to apply it
- You shouldn’t be applying DevOps, or any of its tools, just because of the trend, but because they solve a real problem within your team
- What you want for you team are generalist software engineers, who can code but also reason about Ops and SRE
The talks are listed in the order they were given.
What are Production Engineers? And how do they help Facebook to keep its systems in production.
Talk presented by Pedro Marques da Luz, Production Engineer at Facebook
Around 2009, Facebook was facing the same problem that every other smaller company is now trying to solve with DevOps: communication between the Dev and Ops teams. Their first approach to this problem was to break the Ops team into two other teams: SRE and AppOps.
The SRE team was responsible for following SRE practices to keep the systems running (some of which we’ve already seen on The Site Reliability Journey) while the AppOps team was responsible for automating everything that was possible when dealing with incidents, alerts and production systems.
In one such example of automation, the AppOps team implanted a system that was triggered by production alerts. When triggered, such a system would run a pre-registered “possible solution”, like cleaning the /tmp
folder or restarting such and such processes. After the system was implanted, 97%
of the alerts that would previously fall to human hands were solved automatically!
A little further down the road, Facebook killed the SRE team completely and renamed the AppOps team to Production Engineer, which is now composed of people with hybrid skills, from software engineering and software operations. These Product Engineers are the ones responsible for keeping the production systems running, deploying new features and aiding developers while doing so.
Besides some general information about how Facebook run its systems, what I extracted the most about this talk was the feasibility of creating an automatic system to treat alerts and error in production systems. It made me think about some opportunities to do the same here at SAJ ADV.
The Hidden Costs of Chasing the Mythical “Five Nines”
Talk presented by Steve Fox, founder and CEO of AutoScalr
Some software companies strive to reach the Mythical five nines of availability, that is, a system up and running 99.999%
of the time. But this quest of availability is, in most cases, a waste of time, money and effort. Even mammoths like AWS, Azure and GCP guarantee SLAs between 99.9%
and 99.99%
, at most. Also, ISPs around the world have even lesser numbers, between 98%
and 99%
. So if your user is trying to reach your system, which holds the incredible five nines of availability your team so tirelessly chased, he/she might not even get to your system, because the ISP is down, or the cloud provider is down!
The point here is, you and your company shouldn’t be blindly pursuing such a high availability, because beyond some point, let’s say 99.9%
, it’s simply not worth it. Spending time and effort to reach a higher availability comes at the cost of shipping lesser features and bug fixes. Instead, it’d be more productive for your team to establish SLOs for you system and set an Error Budget based on them. With the budget in place, you could “spend” it to ship new features, which will inevitably bring new errors to the system, and lower the budget. When the budget starts to reach zero, it’s time to slow down the new features, and stabilize the system again.
Another use case for the Error Budget would be to lower your costs with cloud hosting and provisioning. If your system runs on n
instances on AWS, and is always reaching the end of the month with 100%
of the Error Budget left and no errors neither slow requests, then your system could quite possibly still meet its SLOs running on n - m
instances, and your team would be saving the cost of the m
monthly instances.
Monitoring: logs as a first-order element
Talk presented by Izael Effemberg and Thaisa Mirely, from ThoughtWorks
I liked this talk because it mirrored what I had studied and done at SAJ ADV with our logs (A Mild Log Situation). An interesting point I forgot to make on my post when discussing information you shouldn’t log was to cite GDPR and the Brazilian equivalent, Lei Geral de Proteção de Dados, as incentives to convince you to not log private information about you, your company and your customers.
Big Data Platform for Bioelectrochemical Systems
Talk presented by José Pedro de Santana Neto, Data Engineer at Dynamox, and Simone Perazzoli, researcher at the Federal University of Santa Catarina
An interesting peek into the field of Bioelectrochemical Systems: a research on how to make electrical batteries out of bacteria from waste.
The talk slightly fit into DevOpsDays because of the way they are dealing with the data generated from the tests. The data is collected from Arduino sensors, goes to a Kafka cluster, which serves the data to a Spark processing pipeline that runs some calculations. The data is then sent to a Elasticsearch database, where is consumed by a Kibana dashboard and presented to the researchers.
This way, the data is consumed and analyzed faster by the researches, which is a huge advantage, considering that the battery-bacteria setup takes at least 6 months to be ready to produce relevant data.
A day in the life of a Performance Architect at Netflix
Talk presented by Martin Spier, Performance Architect at Netflix
Given the absurd scale of Netflix’s systems, with hundreds of microservices, not a single person holds the mental representation of its architecture anymore. A Performance Architect debugs a problem by leveraging visual tools (some of which were developed by Martin) that represent the flow of requests and information through the microservices and systems, as well as the total input/output of such requests in the microservices.
What drew my attention the most on this talk was the possibility and usefulness of leveraging visual tools when debugging a complex system or a difficult bug. Yes, these kinds of tools are as old as debugging itself, but seeing them in action (because Martin made a “live debugging session”) opened my eyes!
How to start the DevOps transformation at your company
Talk presented by Henrique Bueno, CTO at Estabilis
In short, DevOps is not a job position, and it’s not a team, but a culture that should be adequate to your company.
DevOps Behind The Scenes
Talk presented by Mateus Prado, Cloud Solutions Architect at C6 Bank
DevOps, being a culture, is not for every team nor company. But because it’s trendy right now, and it has been for the past years, everybody’s trying to get into the bus. Also, a new DevOps tool to expel all that came before is created everyday, and suddenly everybody’s attempting to replace the old tools and use the new ones. You know, because it’s trendy. Look at this DevOps Periodic Table:
Bottom line is, your team and company should, before attempting to implement DevOps or any of its tools, check if that’s a valid course of action for your specific case. Not every tool is for every team. The only thing that matters is to get the new features and bug-fixes as fast as possible to the customer:
- if the code is not tested, it’s not ready
- if the code is tested but it’s not in the main repository branch, it’s not ready
- if the code has not been deployed and it’s not in production, it’s not ready
- if the code is in production but it’s not being used by the customer, it’s not ready
Final considerations from the talk:
- Solve real problems
- Measure everything
- Do not wait for the next problem
- Green != production ready
- Sense, no luck