So Who Should Be In Charge of Deployments, Anyway?
Mark Imbriaco said something provocative about software deployments the other day:
Deployment and ops are orthogonal. Once people understand this, the world will be a much better place.
I honestly could not decide whether I agreed with that sentiment or not. After mulling it over for a few days I realized I would have to write a blog post to sort this all out.
Why Do I Care?
I'm the manager of the Release Management group at a very large internet company. Our mission is all about deployments. We help developers write well-formed packages. We help QA set up their test environment. We design and run tests on software releases ourselves to verify correctness. Finally, we push the code to production. We push releases to more than 10,000 servers every week.
Thus, I think a lot about how software deployments should be done. Here I want to describe our current model a little bit, and then tell you about where we are going and what it means.
The Current Process
First, I should tell you that my release management group is functionally part of the operation organization (Service Engineering). We come from a system administration background, and we're a bunch of perl scripters.
Our existing process reflects that heritage. We use a push-based deployment mechanism (sorry @jtimberman that's a topic for another post). The release assembly process is complicated and time-consuming, largely due to the age of our systems. I don't want to leave the impression that we are sloppy or that our process doesn't work. We are a very dedicated group and we do a heck of a job minimizing outages and making things go smoothly. I'm immensely proud of the smart, talented, and dedicated people I work with.
The downside is that this process is slow. Right now we are on a 3 week development cycle for our major component. The actual software push to all those servers is done over two of those three weeks. Basically we are continuously pushing code, but not in the way you mean when you say continuous deployment. To make this schedule work takes a dozen people mainly working on building the software, testing it, and assembling the releases (including folks from QA, development, and service engineering). That's a lot of man-hours. Also, there's not really any way we could make the process much faster. I think realistically we could get it down to a two week release cycle but that leaves very little room for error.
Sidebar: Ops Needs vs. Deployment Needs
There is a fundamental conflict between operations and deployments, and I think that's what Mark was getting at in his comment. Operations is about site up. There are other goals, but those are all secondary to the primary concern of keeping the lights on and keeping the business running.
Deployments are production changes and that means a risk of outages. That also means tension between the various groups involved. It's easy to blame the people doing the deployments, especially when you have a change management system which assigns a 'responsible person' for every change.
When an outage does occur during or after a software deployment, the operations approach has to be focused on getting the site back up as soon as possible. That usually means manual changes and not following the defined deployment process. That's the correct thing to do, but it's not something the people doing deployments should be involved in directly. The Release Management team acts as a gatekeeper to monitor those manual production changes and make sure they get folded back in to the official deployment process. That's the most important task my team performs. It's also a task that is outside the scope of day to day site up.
Like I said, our existing process works - it's just slow and resource-intensive. Everyone knows that is the case, and we're constantly trying to improve the system. On the development side there's been a very heavy push towards continuous integration. A Build Engineering group in the development group is responsible for CI and they have been able to convert large chunks of our software to that process. Now the code gets built at least once a day and run through various tests via Jenkins.
The obvious next step after continuous integration is continuous deployment and we are beginning to do that as well. We now have pipelines which take some of our software from source code through build and test and then on to deployment to test servers. That's not continuous deployment to production but it's a big step towards that goal. We've removed a lot of the manual interactions which slow down the release process, and that allows us to find faults more quickly. One command in Jenkins gets us to the point where lots of automated tests have been run and the test servers are ready for the full QA treatment.
It's unlikely we will ever get much further than this with continuous deployment on the existing historical software components. The biggest issue is test coverage - there's just not enough of it and it's not complete enough. Unless you are starting from the ground up it's extremely hard to fully automate build and release.
Recognizing this limitation, we've chosen to focus most of our efforts on making sure that new components are fully integrated into a CI/CD pipeline. The theory is that the old components will eventually be retired so that attrition will take care of the problem of modernizing old components.
So Who Owns What?
My group owns the release and deployment process, and we're in Operations. As I outlined, we aren't a traditional operations team. We are rarely involved with on the ground firefighting. However, outages tend to get assigned to my group, as we're the ones pushing the code to production. We end up being a very horizontal team - our time is split fairly evenly between working with Ops and working with Devs. Ultimately I think we end up living in both worlds.
It's important to understand that you need a lot of operational knowledge to do deployments well. That's one reason why it makes sense for our team to be located in the larger service engineering organization. When software deployments have problems, those problems are largely traditional ops issues such as unexpected memory exhaustion or network load. You have to be able to understand the whole system and the only way to do that is have a solid system administration background. The people with those skills are found in Ops. Also, when outages do occur the people doing the deployments need to communicate directly with the people doing monitoring and keeping the servers running. That's another good reason to put Release Management in Service Engineering.
However, the future is clearly all about system administrators acting more and more like developers. We spend all our time thinking about continuous integration and deployment. That means we have to know how to plug our existing scripts in to Jenkins. We have to be able to do much more than just react to outages - we have to proactively improve our systems. When I expand my team I'm not going to hire someone unless they have solid scripting experience and are interested in thinking about the overall system as a whole.
Our biggest challenge in Release Mangement is how we connect with the Build Engineering team that's over in the development organization. I think it's not unreasonable that in the future that the two teams will somehow be merged. We work together very closely already. An integrated pipeline all the way from check in to deployment probably means we need an integrated team as well. However, right now we're doing just fine with two separate teams - there's so much work to do on both sides on various projects that we all stay extremely busy.
I see large-scale software deployments as the perfect place to apply the principles of DevOps. As I said, my group is already a very horizontal team. Communication between ops and devs is absolutely critical and that's how I spend a lot of my time right now. The expectation is that we are in the initial planning meetings when developers are just starting to think about new projects. We are right there helping to make decisions about how software is designed. That makes a huge difference when it comes to overall operability and reducing production 'surprises'.
Ultimately, deployments and operations are very different things. To deploy software successfully at scale you've got to look at the entire system architecture. You can't focus very much on the one-off problems and firefighting. To be sure, those are absolutely critical tasks as well. Release Management wants to do everything we can operationally to reduce outages and maintain quality. That means we need to be very closely aligned with the operations team.
So Mark, at the end of the day I halfway agree with you. Deployments and Operations are two very distinct things. You need very different perspectives for each. However, deployments are about making production changes and I think that's a really important reason that my group needs to be aligned with Operations. We need to directly interact with the people keeping things running. We also need a very strong system administration background so that we can effectively communicate with the rest of the operations team.
I think the very best way to deal with this is to embrace DevOps. My team worries about things like metrics and agility. We focus on about communicating between all the different groups and making sure that we are pushing quality code out to production. At the same time we're very focused on pushing forward with the next level of automation via continuous integration and development.
Finally, I leave you with these closing thoughts about what happens when you don't do proper release management: