Monthly Archives: April 2013

Continuous Integration

I’m having lunch with Paul Grenyer today to discuss Continuous Integration, or CI. In a nutshell CI is an automated process that performs a build on a regular basis, be that every hour, overnight, or on every commit to a major branch. Ideally your build will also run your unit tests and any other tests or analysis you run meaning that at any given moment in time you can be confident that your build is sound.

CI has been a part of my coding life for so long that I can’t even remember when I was first introduced to it. What I do remember is that, initially at least, it was setup and handled by others. I simply had to check in my code and hope I didn’t get the dreaded “Build Broken” email from Cruise Control.

CI was so ingrained into me that it was a bit of a shock when I moved to a company that didn’t use it. We couldn’t even use the joke phrase “It compiles, lets ship it!” because we didn’t know if it actually did compile from commit to commit. A quick Google, much swearing and half a day later and we had Cruise up and running. Now, not only was there the “Build Broken” email to fear, there was editing the Cruise config file after each release to point to the new release branches. Cruise is (or at least was) not the easiest thing in the world to configure.

My tenure as a Cruise admin was mercifully short lived as I discovered Hudson which is much, much easier to configure. My fun with release branches continued until we moved to Git. By this time Hudson had forked and we had gone the Jenkins
route. Jenkins now runs CI builds, overnight builds, release builds and has been pressed into service as a handy way to kick off a few scripts either periodically or on request.

Our Builds

Much as I’d love to use Maven our legacy code makes that difficult. Instead we have a build project that handles all our builds using a set of quite complex ant scripts. Locally the developers have the option of:

  • clean: Delete all build artefacts. Not sure this is ever used, but it’s there, just in case.
  • compile: For our legacy code this does a local build and puts the build output in the directories required to run everything locally. Thanks to the magic of our system running things locally is different to running it in any other environment. For the newer code base this just compiles the code locally allowing you to run it. Given Eclipse does the same anyway it’s a target that is rarely used in the newer projects.
  • deploy: Perform a full build of the project, including Checkstyle checks, JUnit tests, Cobertura code coverage and packaging the code into it’s final zip, jar, war or ear (depending on the project). If this completes for all projects and dependents you have altered you can be reasonably sure that Jenkins will not fail your build. In the rare case that it does you are exempt from shame and punishment as it’s invariably something you couldn’t have known about.
  • sonar: Perform a deploy, then run Sonar over the code which performs an enhanced set of checks configured in Sonar. Keeping Sonar green keeps me happy, but unlike build failures, chasing a clean Sonar result should not be done at the expense of actually getting work done. Sometimes good enough is fine.
  • verify: The newer code base is split over a number of project. Verify runs deploy for each project checking that you’ve not broken anything in another project that may depend on your code.

Sat on top of this is the set of CI build targets run by Jenkins:

  • ci.build: Run on master and the release branches after each commit (currently Jenkins polls every 60 seconds, I’d like to change this to a commit hook one day), it calls deploy on each project. Unlike verify, which is a single ant build that calls deploy on each project Jenkins runs a new ant build for each project. This has caused issues where verify builds clean and Jenkins fails and vice-versa.

  • push.build: This is a manually run parameterised build that takes the given version number and creates a production release with a unique build number. This calls deploy but overrides a number of parameters so the version details are configured correctly. It also pushes the resultant zip, jar, ear or war in a staging area.

  • promote.build: Another manually run parameterised build that takes the build number generated by push and promotes it to the specified environment (development, one of the QA environments or our pre-production environment). This simply copies the staged files from the previous push, guaranteeing that the same release is tested in each environment.

  • release.build: Identical to promote.build except there is a checkbox that must be ticked agreeing to the warning that this is going to production. The destination becomes the production staging area.

  • overnight.build: Run overnight by Jenkins, this calls sonar and provides a nightly snapshot of the overall quality of our builds.

New projects just need a simple ant file pointing at our build project with a few variables set to gain all of these targets. It’s then just a question of cloning a the Jenkins jobs from another project, making them specific to the new project and you’re away. Maybe not the most elegant of systems, but its reliable and adaptable.

Agile In The Real World

No plan of operations extends with any certainty beyond the first contact with the main hostile force.” – Helmuth Carl Bernard Graf von Moltke

I’ve been doing eXtreme Programming (XP) and Agile in one guise or another since the early 2000’s. During that time I’ve been in big teams, small teams, bureaucratic organisations, lean organisations and chaotic organisations. I have never worked in a top down Agile organisation and probably never will. Also, no two teams I have worked in have done Agile the same way. I suspect this is partly to do with the organisations the teams were part of, and partly to do with the teams themselves. This is not a bad thing.

Agile is a toolkit, not a rigid set of structures. As with all toolkits, some tools fit better for certain circumstances than others. A good team will adopt an Agile process that fits them, fits the business they work in and then adapt that process as and when things change (and they will). If you’re looking for a post about “How to do Agile” then this is the wrong place. I can’t tell you, I don’t know your team, or your organisation. Instead this explains how I’ve implemented Agile for our team and our organisation in order to get the maximum benefit.

PDD

Most (all?) discussions on Agile seem to use a sliding scale of Agileness with pure a Waterfall process on the left, a pure Agile process on the right and then place teams somewhere along this axis with very few teams being truly pure Waterfall or pure Agile. I don’t buy this. I think it’s a triangle with Waterfall at one point, Agile at the second, and Panic Driven Development at the third. Teams live somewhere within this triangle.

So what is Panic Driven Development? Panic Driven Development, or PDD is the knee jerk reactions from the business to various external stimuli. There’s no planning process and detailed spec as per Waterfall, there’s no discreet chunks and costing as per Agile, there is just “Do It Now!” because “The sky is falling!“, or “All our competitors are doing it!“, or “It’ll make the company £1,000,0001, or purely “Because I’m the boss and I said so“. Teams high up the PDD axis will often lurch from disaster to disaster never really finishing anything as the Next Big Thing trumps everything else, but even the most Agile team will have some PDD in their lives, it happens every time there is a major production outage.

When I first joined my current company it was almost pure PDD. Worse still, timescales were being determined by people who didn’t have the first clue about how long things would really take. Projects were late (often by many months) and issue tracking was managed by simply ditching any issues over a certain age. In short it was chaos. Chuck in a legacy codebase with some interesting “patterns”, a whole bunch of anti patterns and a serious amount of WTFs and you have the perfect storm: low output and poor quality.

Working on the edge of chaos

One thing I realised very early on was that I was not going to be able to change The Business. The onslaught of new things based on half formed ideas was never going to change and the rapid changes of direction were part of the companies DNA. Rather than fight this we embraced it, with some caveats.

Things change for us, fast. Ideas get discarded, updated and changed in days and the development team needs to keep up. To achieve this we use Scrum… except where we don’t, and use Kanban instead. Don’t worry though, it’s not that complex. 🙂

Scheduled work is done using Scrum. Sprints are a week long and start on a Wednesday2. Short, rapid sprints mean we can change direction fast without knocking the sprint planning for six. If the business want to change direction they only have to wait a few days. Releases generally (but not always) consist of two sprints of work. A release undergoes 2 weeks of QA after leaving development so will generally be in production 4 weeks after the sprint started. If need be we can do a one sprint release with as little as one week QA and have a change out within 3 weeks of it being requested.

Sat on top of that we have a Kanban queue which should remain empty at all times. It is populated with QA failures and critical issues that are either blocking the release, or require patching. Every column on the Kanban board has a constraint of 0 items. Put something in it and it goes red, making it pretty obvious that someone needs to fix something sharpish.

The sprint planning meeting, retrospective and costing are all handled in the same Wednesday morning meeting which lasts an hour. First up we look at the state of the outgoing sprint. We look at what got added to the sprint after it started, and why; what was removed from the sprint, and why; and what wasn’t completed within the timeframe of the sprint, and why. We run a system whereby it’s OK for things to span sprints. Things overrun, things get stalled, and sometimes it’s simply that you had half an hour left in the sprint, added a new issue to work on, but never had enough time to finish it. Any concerns are raised and handled, then the sprint is closed. The next sprint is then planned using a moving average of velocity as guidance for how much work to add. Any time remaining in the meeting is used costing and curating the backlog. Sadly the business rarely attend these meetings meaning we need to be creative when it comes to business sponsors.

Finding Business Sponsors

Unlike traditional Scrum we have two backlogs. With over a decade of technical debt and more new development than we can possibly hope to achieve we have hundreds of issues. Clearly this is unworkable. The majority of these live in the un-prioritised backlog. We know about them, we’ve documented them, but they’re not getting done, and may not even get costed unless someone champions them and gets them pushed into the scrum backlog. The scrum backlog is the realistic backlog. We aim to keep no more than 4 x average velocity worth of work in this backlog which means at any given time it provides a roadmap for the next month. We also make sure everything in the scrum backlog is properly costed meaning sprint planning is incredibly easy; just put the top 25% of the backlog into the sprint, adjusted for holidays and various other factors.

Using this method you very quickly find sponsors coming out of the woodwork. When work is not done people start asking where it is, you can then explain to them that it’s not been prioritised, or it’s being trumped by other work. If they care about the issue then they need to champion it, become the business sponsor and take responsibility for it. They can argue the case for it being moved up the backlog with the business. If they don’t want to do that then clearly the work is not important, so it goes into the un-prioritised backlog to eventually die through lack of interest. Stuff that is already in the un-prioritised backlog can be fished out when a sponsor is found and costing can start.

Bugs generally follow a slightly different process insofar as they will always have a sponsor, even if it’s the testing team. Bugs are never closed unless they are fixed, or cease to be an issue due to other changes. The QA team will regularly revisit all open bugs and re-prioritise or close them as necessary.

Costing

New features are costed using planning poker and we use very small stories. Valid costings are 1 (1 line change), 2, 3, 5, 8, 13 and 20. Our target velocity is between 8 and 13 points per developer per day. Any slower and we’re being too optimistic with our costing, any faster and we’re being too pessimistic. Bearing that in mind a developer should easily handle two 20 point stories in a single sprint with room to spare. Anything larger than 20 points needs to be carved up into multiple stories, or turned into an Epic. We do this because estimates get rapidly poorer once you go past a couple of days work.

Stories are only costed if the team fully understand the issue. If there are questions the issue is noted and the questions taken to the Business Sponsor. Yes, it would be great if they were in the costing meeting and could answer the questions there and then, but it can be a little like herding cats sometimes. The cost to the business sponsor is that the issue isn’t costed and can’t go into a sprint until it is, and it’s a cost they’re incurring by not attending, not that we’re imposing on them.

Stories that exceed 20 points are either quickly split into a couple of stories, or converted to an epic and a costing task raised. This allows time in a sprint for one or more members of the team to find the full set of requirements from the business sponsor and generate the full set of required stories for later costing.

Scope creep can either be added to a story, or a new story created for the creep. If it’s added to an existing story it’s old costing is discarded, and if that story is in the current sprint it’s ejected from the sprint until it’s been re-costed and space made for it. The costing may happen there and then with the team having a quick huddle, or it may need to wait for the next planning meeting.

It’s not a silver bullet

Nothing is written in stone except the maximum velocity of the team. Sprints can start late, end late or end early. Releases can be held back, or bought forward. Issues can be removed from the sprint and replaced with others. We can react to the business, but it’s not a silver bullet. The more the business change their minds the slower throughput gets due to the inertia of changing direction, however, they are now better informed and can see and measure the effects of this which has resulted in a lot less chopping and changing.

Projects are now being delivered on time, however, the timescales are also now realistic, and easily tracked. Projects are becoming better defined as the true cost of them is realised by those proposing them. The output is similar to what it used to be, but is now more focused. Rather than over promise, under deliver and spend months cleaning up the mess certain projects just aren’t even attempted.

The process is continually evolving. We’ve done pure Scrum and pure Kanban before. The model we use took the most useful aspects of both of these systems. As we try new things we’ll take the best bits and adapt them to suit us. No doubt there are Agile Evangelists out there who will balk at one or more aspects of what we do as being wrong. Maybe they are, all I can say is they work for us and the team is happy with how we work. If they’re not, we change it.


1 I have worked on quite a few “million pound” projects or deals. The common denominator is that all of them failed to produce the promised million, often by many orders of magnitude.

2 Why Wednesday? People are more likely to be in. There isn’t that last minute panic on Friday to get everything finished and the sprint doesn’t start on a day where people are catching up after the weekend.

Overlooking Social Channels

We recently suffered an 8 hour outage from our payment provider. The most frustrating thing about this outage was the complete lack of information from the payment provider about the problem, or indeed the lack of any communication whatsoever. Yesterday we got reports from our front office staff that they were having problems with payments again. A quick check of the logs confirmed that, yes, there was a problem somewhere. Given the nature of the issue it was likely to be a problem with out payment provider but we needed to be sure. We approached getting this information in two ways.

My boss took the traditional approach, contacting the account manager to see what light they could shed on the problem. Net result: there may be a problem, further information would be forthcoming in 30 minutes after a meeting their side.

Given the informational black hole from the last outage I took a slightly tangential approach; Twitter. In seconds I was able to confirm that others were seeing the same problem and it has started at least 3 minutes ago. Two minutes after that, and only 5 minutes after the outage started I had the entire company either on, or preparing to enter a BCP stance. Part of this involved speaking to our social media team because they needed to be poised to inform customers and handle customer queries and complains.

25 minutes later the outage ended. Again I was able to confirm that there were no intermittent problems through a combination of our logs, talking to our staff and responses from people on Twitter. We still hadn’t been called back by the account manager, there was still no official communication about the outage. As far as I’m aware, some 24 hours on, there is still no official acknowledgment.

These days companies, especially large ones, need to understand that they have a social media presence, even if it’s not official. Search for our payment provider during an outage and the torrent of negative opinion and pleas for information are abundant. In this case the presence of, and silence of the official Twitter account only fuelled this frustration. People expect frequent and honest updates, especially when it’s something as important as a payment provider. BCP should include informing customers of the outage, the extent, estimated duration and any other pertinent information. Even if it is “We are aware of issues with the payment gateway. Engineers are looking into it, update to follow in 10 minutes“. Not wanting to say anything for fear of negative reaction is pointless. The negative reaction is already out there. How you present the information is also critical. Use of the word “intermittent” for a problem that is affecting 99 out of 100 transactions, while technically accurate, is clouding the situation. “We are suffering from intermittent problems” in this case sounds like spin which sticks out like a sore thumb in a sea of negative statements.

Effective management of the various social media channels is something that is overlooked by ‘traditional’ far too often.