When I started working at CA Veracode in 2006, we were developing software the way I had for over 15 years – we were using Waterfall. It would be six years before we moved away from Waterfall and took the Agile plunge, and even longer before we got to DevOps.
Looking back, I wonder how much farther along we’d be today if we had adopted the Agile methodology, which at that time was cutting edge. Typically, Waterfall projects are measured in months – we released every three-to-four months in the early days. There are lots of documents and review meetings to make sure you understand what needs to be delivered. Then you start the work, one silo at a time. Development gets the first action. Everyone works in their own branch and then merges changes back to trunk. The longer you work on your own, the more painful it is to merge. When development declared that we were “done,” quality would get their turn at bat.
We had a few advantages over other Waterfall projects. The biggest was that CA Veracode is a SaaS (Software-as-a-Service) offering, which meant we only needed to test one upgrade – our production system. We were able to release more quickly as a result. It was in those days that we began the custom of push nights.
Push nights started after a full day’s work. The team would go out to dinner and then return to upgrade our production system. In the early days, there were typically three-to-five people involved. (That number would balloon over time to 25 people when we adopted Agile.) The push would start at 10 PM with the goal of finishing before 2 AM. As a young company with only a few customers, that was fine.
The worst push night occurred in about 2011. We were doing the biggest database schema change in our history. During push nights, we had a procedure to follow. We were starting to grow beyond the ability of any one person to remember everything, so we started to write down the steps needed. What we hadn’t accounted for was that there was an important pre-step to our upgrades. We were supposed to communicate with our administrators to turn off the database backup for the window of the upgrade. It never happened. It took a while for us to figure out why the schema changes were taking so long, but by then it was too late. We blew through our window and didn’t complete the tasks until 5 AM.
Another big challenge we faced was the product behaving differently in production than in our test environments. This was of course because our test data didn’t accurately mimic what we had in production. It never does. (I know that’s a big surprise to the reader.) Anyway, after our deployment we would bring the application online, but invisible to our customers, so that we could run some basic tests to ensure that the system was operating properly.
When a problem was identified, we really only had one choice – create a hot fix. We didn’t build and test downgrades in the service, so we had to roll forward. These push night heroics usually entailed a developer disappearing to code a change, running a quick build, and deploying to production for a re-test. It’s not the best way to deliver quality software, but as a startup you do what you need to do to survive, and we did. But the larger and more complex the product got, the more build and test problems we had. This problem was compounded as we grew the team. While there were some smoke tests that were automated, they were inadequate and not consistently maintained or executed.
Compared to today, our security process was very immature and manual. Security was our number one job and we didn’t take short cuts there. There was a lot of pen testing and secure code reviews. However, our static analysis engine hadn’t yet built support for Java, which our platform was built on, so slow and manual testing was the name of the game.
Among our biggest problems in those Waterfall days were:
- Lack of automated functional tests
- Quality and staging environments that were poor copies of our production system
- Large batches of changes in every release, making it hard to determine root cause
- Quality and security defects were found really late in the release process, creating lots of unplanned work and hurting predictability
In 2012 we started to attack these problems when we transitioned to Agile. That’ll be the next installment in this blog series exploring CA Veracode’s journey from Waterfall to Agile and finally to DevOps.
I’ll be sharing stories of some of the difficulties we’ve had, how they were helped or hindered by the methodology we were using, and what we did to tackle those challenges. I’ll explain the changes we made to people, process, and technology, and describe how we built security into our development process along the way. I hope our experiences can help you in your own journey.