Developer Drain Brain

November 18, 2010

A Journey Through Source Control. Part 3. Thinking the Wrong Way Round

Filed under: Development — rcomian @ 1:32 pm

It’s time for a short rant. Something I’ve seen a lot when working with fellow developers who haven’t been formally introduced to a tool from the 3rd generation is a set of practices that just leave a bad taste in my mouth.

First off, a 2nd generation tool would generally make mapping the repository to the hard-disk very difficult, so people would check out the entire repository onto their disks to work on. This was fine at the time, but lead to a lot of problems. One of them was that no-one was ever isolated from each other’s changes – if you ‘got latest’ there was a very good chance that some of the shared code would be broken in some subtle way. Do that a few times and you quickly get wary of the practice, and end up just getting the minimum you can get away with at any point in time.

But if we look at the 3rd generation, things are a bit different. We’ve got specific, well defined versions that cut across all our files. Now, I don’t know about you, but to me, it feels much better to pull out a single card from a deck (hopefully the top one), work on it, then put the changes back as a new card on the top of the deck again.

If someone was to take that card, then build it, assuming that the build works we *know* that that version is good … and it will be forever. If I want to make another change, I can take that card, in its entirety, with impunity, and *know* – with 100% certainty that it will work. This means that when I make changes, if they work, then the card I put back on the pile will work as well. And we can guarantee that by having an automated system take that card and test it, giving it a green tick if it’s ok.

Unfortunately, most 3rd generation tools let you mix and match cards, so the practice of getting the least you can get away with still kind of works with them. But look – I’ve got a stack of shiny cards here, and each card has been carefully crafted to guarantee that it’s good and has been verified by automatic tools that this is the case. But if you have half of your files from one card, a third from another and the rest from a half a dozen other cards – exactly what are you working with? How can you guarantee anything at all? How can you be sure that the card you’re about to put on the top of the pile will work with everyone else, when you’ve never tried it yourself? (You can’t have tried it – you don’t have it). It feels about as messy and taking a dozen playing cards and making a new card out of them by cutting them up and taping them together.

Of course, no-one’s perfect. Mistakes will get into the system. But if I take a bad card, and make my changes to it, the card I put back will also be bad – and I won’t know if it’s my fault or something that was already there. This means that we need to identify bad cards as soon as possible, and fix them immediately, so that we can be sure if we find a bad card with our name against it, the problem really was ours. The fix is just a new card, (since we can’t change the old cards), but the fact that we’ve spotted a bad card and fixed it quickly means that the next cards that get added to the system can be sure of a good stable base to work from.

One of the things I like most about this system is that if the worst comes to the worst, we can always scrap what we’ve got, get the latest known good card, and work with that. Instant cleanup is always available. We can also pull any particular card out and see if it had a particular problem, and if it did (and the previous card didn’t) we have a good start at working out exactly what went wrong and a shortlist of what files might be involved.

So we’ve not a got a mental model of what our 3rd generation system is doing. We’re working on a single card at a time and building up our stack of changes, verifying each one as we go. Next we’ll look in a bit more depth at how branches work, what they mean and how to work with them.


November 17, 2010

A Journey Through Source Control. Part 2. Jumping to the 3rd Generation

Filed under: Development — rcomian @ 1:27 pm

Ok, I admit it, most people just don’t care about source control, and why should they? It’s a tool that they use to get things done, and so long as they don’t lose their work, who cares about anything else?

I think it’s because of this disinterested mindset that a lot of people completely missed what the 3rd generation of tools were about. And it’s not too suprising, advertising like “ACID compliant transactions – never partially commit a set of changes again!” aren’t exactly hinting at the revolutionary upheaval that had happened on the server side, it just sounds like CVS in a database.

But what happened was that the history in the repositories was completely pivoted around. CVS, VSS, et al would keep a single tree of files, and each file would have it’s own history. SVN, TFS, etc, don’t keep a set of files, they keep a set of trees. Each revision in the repository represents the state of every file in the entire source control tree at that point in time. Think of a stack of playing cards, each with the entire source control tree drawn on it. Each time you commit a change to any file in the repository, an new card is drawn up with the previous tree + your changes and put on the top of the pile. This means that the entire source control in the repository can be reliably retrieved by giving just one number – just tell the system what card you want and you can get it.

No longer do we need hacks like ‘labels’ to make sure that everything is kosher and retrievable, everything is labelled with each and every change as an automatic by-product of how the repository works. Of course, we don’t lose the ability to view the changes that were made to a particular file: given a path, we can diff the contents of the file in one revision with the contents in the next and get the changes just like we could before. But we also get a brand new ability – by getting the difference between one tree and the previous tree, we can see all the files that were changed at the same time – and it’s this set of diffs that makes up a ‘changeset’.

It’s this natural creation and handling of changesets that gives us a much better ability to migrate changes around the repository. Say I have two identical copies of a file in two places in the repository. If I make a change to file A, I can retrieve the changeset at any time I like, this results in a diff, and I can apply this diff to file B whenever I feel like it. Expand this concept to directories and work with changesets instead of individual diffs, and we’ve got the basis for a very flexible and simple branching and merging system. Branching is simply making a copy of a directory in another location, merging is simply getting the changesets that were applied to one directory, and applying them to the other. This branching and merging can all be tracked within the repository itself and voila, we have a modern 3rd generation system with full branching support.

So now we know what the differences between the 2nd & 3rd generation are, we’ll address in the next post some of the pitfalls that we fall into when we try to think of a 3rd generation system in terms of a 2nd generation system.

And don’t worry, I’ve not forgotten about the 4th generation, that’s coming, but since it builds on what the 3rd generation gives us, I think it’s important to square that away first.

November 16, 2010

A Journey Through Source Control. Part 1. Generations of Source Control

Filed under: Development — rcomian @ 1:27 pm

In this series I’m going to be looking at some of the key concepts available in source control, and how they can be applied to gain full control over your source code.

Now, in all fairness, I don’t make my living by telling people how to use source control, but it’s one of the things that fascinate me about software development, purely because it’s so underused and misunderstood by a large percentage of my fellow developers.

The first thing that I think is important to grasp is that we’ve already churned through several generations of source control tools. I can easily identify 4 generations based on just the surface architectures involved, and each one represents a significant advantage over the previous generations.

Source Control Generations
Generation Feature Example
1 History of individual files SCCS/RCS
2 Networked server CVS/VSS
3 History of whole trees SVN/TFS
4 Distributed repositories Git/Hg

So with the first generation of tools such as RCS, we had the ability to keep the history of the files we were working on. That was pretty much it, really. Working with multiple files was tricky at best, and sharing wasn’t exactly safe, as everyone would work in the same shared folder.

CVS & VSS came along and rescued us from this situation by wrapping RCS and keeping all it’s files in a repository, which could be accessed through a server on the network. This meant that all developers could have their own copy of their files on their machines, isolated from everyone else. Then when work was complete, the changes could be sent up to the server and new changes brought down.

This CVS style working is all very well, but it still has problems – namely that files are still versioned individually and that it can be very hard to reliably isolate changes that happened together across a range of files after the fact. SVN and others came on to the scene to rectify this by keeping all the revisions in a single stack of changes, numbered sequentially. Changes to multiple files could be kept within a single revision, or changeset, and you could see not only what changes were made to a particular file, but drill down into each change to find out what the full story was each time. This also makes branching and merging a far more tenable idea since a complete ‘changeset’ could be reliably migrated in full across different branches.

And most recently, after having 3 generations of centralised repositories, Git, Hg and others threw away the need for an enforced central repository. The effect of this is that it no-longer matters exactly where source code is located – what matters is the actual content itself, which could come from anywhere. Whether I pulled revision ’24g4g2′ out of the main company server, a junior developer’s laptop, a pirate website or the smoking remains of a tape in the server room, I would be guaranteed that the code was exactly what I was after, including all the history up to that point. Central servers are now used at your team’s convenience, not just because the tool demands one.

So there we go, a very short, and incomplete, potted history of source control tools.

Next we’ll look in more detail at the jump from 2nd generation to 3rd generation – and how so many people are thinking of it the wrong way round.

Blog at