Practical Dev+Ops for Enterprise IT
Ops <> Support
When I talk to many of my colleagues (esp. in the Enterprise IT at large organizations) on how their DevOps journey is going, this seems to be a recurring theme — Many have put in tools to automate their Dev but only a minority have solved the challenge of combining Dev & Ops (the original definition in Wikipedia BTW). It doesn’t help that many Tool vendors push DevOps strategy that are focused only on the tools aspect of DevOps. Also, the tools/automation aspect is more tangible and perhaps has more visible benefits.
Many of the common refrains I hear is that there are organizational barriers that prevent combining Dev & Ops
“My developers don’t want to be oncall”,
“But Ops is a different organization. It’s too hard politically to combine that.”
“We still need a separate team for L1, L2 support. Right?”.
All probably good reasons but all of them missing the point — Ops is NOT just “Support”. Huh? If that doesn’t make complete sense, read on.
I am in no way an expert in this field but the key intent of combining Dev & Ops is to shorten time to market with high software quality (I shamelessly plagiarized that from Wikipedia).
The “wisdom” of Production
Indulge me for couple of minutes and let me walk you through my early career. Within a year into my career, I was lucky enough to be supporting a complex ecosystem of merchandising applications for a Fortune 500 company. This was years ago and there were some great 3 tier applications with myriad of UI, middleware, database, batch processing technologies — you name it.
After monitoring those apps through a web of excel files that collected various metrics, I sat down one evening and burnt many a midnight oil to create a VB6 application called “PS Buddy” (Production Support Buddy. Don’t judge me.) So what did the app do? Well, it made bunch of ssh calls on startup and ran scripts that aggregated logs and parsed metrics and sent it back to refresh a flat file database. When you selected an application in the VB app, you would get a dashboard of up times, status of system, batch run times, error counts, cause of errors, trending, transaction count vs run time etc.. Well, it was my version of a typical Splunk Ops Dashboard (probably a few months before Splunk was founded !!! Oops just dated myself..)
But why do I tell that story? Well, I cannot describe in words how much I learned about software development during my tenure in that role.
It would have been very hard for me to build that dashboard and more importantly incrementally include appropriate telemetry in the application logs without experiencing the pains of supporting the applications in production.
I believe Ed Murphy (of Murphy’s Law fame) was an Ops guy (he was NOT). It wouldn’t have been possible to understand the need for quick troubleshooting or error handling or incorporating telemetry and alerting without taking those 2 am calls or being paged in the middle of dinner. It wouldn’t have been possible without explaining to the CFO of the company that they cannot close the book since the new guy ran the month close job twice without setting the right restart parameters (The week close job is fine to run twice, not the Month end job !!!). And we caught it when business called us the next morning since there were no error handling/audits for that scenario. We worked 72 hours straight to recreate the whole ledger for the month by combing through every transaction .
I didn’t share the above to just indulge you in my personal nostalgia trip (though if nostalgia was a human being, she just slapped me in the face). This is what Google refers to as “the wisdom of production” in their SRE book. Here is what they say
“by this phrase, we mean the wisdom you get from something running in production-the messy details of how it actually behaves, and how software should actually be designed, rather than a whiteboarded view of a service isolated from the facts on the ground.”
So where am I going with this? Here is the point I am trying to make -
The hypothesis being that if we embed that wisdom, we will engineer our systems to work more reliably in production and will achieve operational excellence. Not by chance, it also aligns with a key tenet of lean and to the “Second Way” of DevOps.
The challenge of Enterprise IT
So back to “Ops” <> “Support”. If you were a company that scaled or were born into the internet era, chances are that Dev + Support was done by the same team. And that’s the easiest way to embed the “wisdom of production” into the Development people/process/tool/mindset.
But if you are an Enterprise that have grown in the 80s, 90s and early 2000s, chances are that you are a functionally oriented IT organization and probably implemented a version of the tiered support model for your applications (the dreaded L1, L2, L3 etc.). You have structures in place (people, process, tool and in some cases organizational politics) which makes it much harder to have the same team do Dev and Support. So what do you do? How do you achieve the desired “DevOps”? How do you embed that wisdom of production into Development?
Well, you can follow the “2nd Way” of DevOps (as explained in the seminal work of Gene Kim, the DevOps Handbook) — implementing the technical practices of feedback. Also, Gene Kim has another chapter in the book on overcoming this challenge for functionally oriented organization.
Unfortunately, easier said than done for a lot of organizations (at least from my experience and what I have heard from my colleagues and industry analyst friends). I think one of the key reasons why this is hard in practice is because it makes the following assumptions:
- There is a unified vision and alignment across the senior leadership in IT
- The different functions of IT are at a very similar level of maturity
- There is an incentive for everyone to work together to make “Dev” + “Ops” happen since it’s a priority for everyone
The reality is that one or more of the above assumptions are always false. And the real objectives of DevOps cannot be achieved in the real world silos of the different teams that exist in these organizations — Cloud, Security, Architecture, ITSM, Infrastructure, DevOps, App Dev, App Support, Service Desk, PMO etc. In real world, all of these teams have their own frameworks, processes, objectives, their own goals, funding and unfortunately politics.
In the ideal world you would think that they are all aligned and dancing to the same tune. But when you get on the balcony, you see that it’s actually a “Silent Disco” with many channels being subscribed.
To really achieve the ideal “Devops” goals in this organizational context, the CIO, CTO, CISO, Infrastructure leader, App Dev leader etc. have to come together, align and work together to achieve it. So while you are waiting for that to happen (some day), I will venture out and be bold enough to suggest a few practical ideas that I have which could help if you are still finding it challenging in your organization to overcome years of organizational complexity.
The Practical “Dev+Ops” for Enterprise IT:
Each of these ideas presented implemented on its own could act as a forcing function and bring you closer to the desired state for Dev+Ops. And I will be honest, none of this is groundbreaking. What I am proposing is combining the key pieces described in “3 Ways of DevOps”, SRE Principles & Practices, Software Engineering Mindset, Principles from ITIL v4, Organizational Context/Constructs to achieve the said goal.
Here is the summary of my high-level “framework” (more of my mental model) of my proposal:
Define Production, your goals & success criteria
The very first step is to define the behavior of key services and application. Here are the key components of this:
- Define your key services and the corresponding non functional requirements: First step of course is to define your key services and applications. This is typically what you do as part of Service Design in ITIL and for some reason what SRE/DevOps methodology assumes that you already have without referring to it explicitly 😊 Once you do that, you have to define your non functional requirements for each of these services. What are non functional requirements? Well, things like Usability, Performance, Security, Availability, resiliency etc. See — NFRs.
- Ensure that you define & manage Service Levels (SLIs, SLOs and SLAs) for your application/service: The next step is to convert you NFRs into measurable service level indicators, define their objectives and any business agreements. The SRE book has the best reference material on how to do this. See — Service Level Objectives.
- Create a corresponding monitoring & telemetry strategy: Once you have defined your Service levels, you need to determine how you will monitor those and what action would you take as part of the monitoring. Once you do that, you can design your corresponding telemetry strategy (i.e. how will you consistently generate relevant telemetry from your services) Again, the best reference for this comes from the SRE book (Monitoring Distributed Systems). Another great source is The DevOps Handbook (Part IV: The Second Way).
Adopt Key Development Practices to simulate Production
The next step is key to try and codify “the wisdom of production” into your development process. This is perhaps the most important aspect since if you don’t do this upfront, it will always be a loosing battle to try and fund this after the fact unless your whole service is a disaster.
- Create an Operational Architecture for Security & Reliability — As very aptly put in the book “Building Secure and Reliable Systems” — Both Security & Reliability are emergent properties of the design of your system, and indeed the design of your entire development, deployment, and operations workflow. Thus your operational architecture & design needs to factor in your security, reliability and availability goals upfront. Again a great guidance on this is in the book (specially chapter 4).
- Have your developers create threat models — This is something I have seen once in my career done really effectively but also is called out as one the things that developers at Amazon do. This is perhaps one of the best forcing function for your developers and engineers to own the security of their applications and services.
- Encourage your developers to do Design FMEA analysis when applicable: How will your application/service fail? Well let me count the ways. You cannot anticipate every single scenario for every single service and this may seem counter intuitive and feel like the path for gold plating (Anti Pattern alert). However FMEA (which stands for Failure Modes Effects & Analysis — a lean Six Sigma Tool) is a great technique for your business critical applications where failure translates into Millions of Dollars of impact. An FMEA analysis is an inductive reasoning method to anticipate the failures in production and mitigating the same through design.
- Practice Chaos Engineering: Another great way to build for failures is through the relatively new area called chaos engineering. With chaos engineering, failure is not a matter of “if”, it’s a matter of when (which BTW was always the case, chaos engineering just makes it a certainty). This is something you should definitely evaluate for complex ecosystems. I am actually halfway through the “Chaos Engineering” book. I would recommend to look into it for guidance.
- Test for Reliability: Another SRE tenet. Zero MTTR Bug? Read all about it here.
- Automate Everything: Automation or as SRE calls it “Eliminating Toil”.
- Hire Engineering Coaches/Mentors/Reviewers: Again, this is the most important aspect of this section. Its not common (and very rare) to get a team of very experienced engineers and developers who can do all of the above practices consistently and correctly. So what do you do? The best answer I have found is to hire experienced engineers who are coaches, mentors and reviewers who help the team implement and mature these practices. One of the common ways that Enterprises are doing this today is through “DevOps” Dojos. Another practical way is to make your senior engineers and architects to focus primarily on mentoring, coaching and reviewing.
Explicitly Enable feedback
Embedding the Wisdom of production into “Dev” is all about feedback. So systemically enabling the feedback from Production to Development goes a long way to achieve this goal.
- Psychological Safety: Building a psychologically safe culture is foundational for 2 reasons in this context — enable blameless postmortems & objective feedback loop from production to dev.
- Blameless Postmortems: Postmortem is a key practice of the incident management process in the ITIL world. Blameless postmortems is also a key tenet of SRE. Again there is a lot of content on this but I would stress is this “ A mistake is seen as an opportunity to strengthen the system”. Thus an incident / failure in production has to be used to strengthen the resilience of the system and we cannot do it unless we recognize the true reason for the failure/incident.
- Build systemic feedback loop from production to Dev: Its great to know why systems fail in production but if you have not build a systemic feedback loop from Production to your Dev process, its an exercise in futility. Here are couple of suggestions on building the systemic feedback:
- If you have a robust “problem management” process, make sure the “problems” are created & reviewed collaboratively and frequently with your dev team.
- Ensure that that follow up / resolution of the problem is fed back into the common backlog for the application/service.
Build Shared Accountability
All of the above is essential, but building shared accountability across “Dev” and “Ops” is perhaps the only aspect that matters. Why? Because it’s all about the People. And people are at the heart of achieving “Dev” + “Ops”.
- Create shared incentives for Dev & Ops: Though the “incentive theory of motivation” may have received flak in the recent past and for determining employee engagement, I think it’s still a key theory when applied to collective organizational behavior. Its also something that Clayton Christensen called out as the “Profit Formula” for an organization. Well, the hypothesis is that incentives drive behavior. So, if you want both the Dev Leaders & Ops leaders & teams to care about overall “Dev + Ops”, incentivize both of them equally for the performance of delivering new features and overall service operations. This is a forcing function to create shared accountability for “Dev” + “Ops”.
- “Tour of Duty”: This is not a new concept and is proposed in both SRE as well in the DevOps handbook. The idea is to temporarily have either few team members in the Ops team or Dev team do a “Tour of Duty” in the other team. This helps with the cross pollination of the wisdom (practices and learnings).
- Continuous Care Vs HyperCare: This is almost a given. With most companies moving to Agile and 2-weeks release cycle, the concept of Warranty or HyperCare are irrelevant. An Ops person has to be embedded on the Dev team throughout for continuous care.
- Build a Modern Shared “Support” Model: And I will end this by contradicting myself :) I was tempted to put together a modern support view of Ops to act as a common language of support across Dev and Ops. I took the traditional tiered support model and worked back to hypothesize what it would look like if you shift left and take a DevOps view of the end to end.
Originally published at https://www.linkedin.com.