Developer Drain Brain

November 18, 2010

A Journey Through Source Control. Part 3. Thinking the Wrong Way Round

Filed under: Development — rcomian @ 1:32 pm

It’s time for a short rant. Something I’ve seen a lot when working with fellow developers who haven’t been formally introduced to a tool from the 3rd generation is a set of practices that just leave a bad taste in my mouth.

First off, a 2nd generation tool would generally make mapping the repository to the hard-disk very difficult, so people would check out the entire repository onto their disks to work on. This was fine at the time, but lead to a lot of problems. One of them was that no-one was ever isolated from each other’s changes – if you ‘got latest’ there was a very good chance that some of the shared code would be broken in some subtle way. Do that a few times and you quickly get wary of the practice, and end up just getting the minimum you can get away with at any point in time.

But if we look at the 3rd generation, things are a bit different. We’ve got specific, well defined versions that cut across all our files. Now, I don’t know about you, but to me, it feels much better to pull out a single card from a deck (hopefully the top one), work on it, then put the changes back as a new card on the top of the deck again.

If someone was to take that card, then build it, assuming that the build works we *know* that that version is good … and it will be forever. If I want to make another change, I can take that card, in its entirety, with impunity, and *know* – with 100% certainty that it will work. This means that when I make changes, if they work, then the card I put back on the pile will work as well. And we can guarantee that by having an automated system take that card and test it, giving it a green tick if it’s ok.

Unfortunately, most 3rd generation tools let you mix and match cards, so the practice of getting the least you can get away with still kind of works with them. But look – I’ve got a stack of shiny cards here, and each card has been carefully crafted to guarantee that it’s good and has been verified by automatic tools that this is the case. But if you have half of your files from one card, a third from another and the rest from a half a dozen other cards – exactly what are you working with? How can you guarantee anything at all? How can you be sure that the card you’re about to put on the top of the pile will work with everyone else, when you’ve never tried it yourself? (You can’t have tried it – you don’t have it). It feels about as messy and taking a dozen playing cards and making a new card out of them by cutting them up and taping them together.

Of course, no-one’s perfect. Mistakes will get into the system. But if I take a bad card, and make my changes to it, the card I put back will also be bad – and I won’t know if it’s my fault or something that was already there. This means that we need to identify bad cards as soon as possible, and fix them immediately, so that we can be sure if we find a bad card with our name against it, the problem really was ours. The fix is just a new card, (since we can’t change the old cards), but the fact that we’ve spotted a bad card and fixed it quickly means that the next cards that get added to the system can be sure of a good stable base to work from.

One of the things I like most about this system is that if the worst comes to the worst, we can always scrap what we’ve got, get the latest known good card, and work with that. Instant cleanup is always available. We can also pull any particular card out and see if it had a particular problem, and if it did (and the previous card didn’t) we have a good start at working out exactly what went wrong and a shortlist of what files might be involved.

So we’ve not a got a mental model of what our 3rd generation system is doing. We’re working on a single card at a time and building up our stack of changes, verifying each one as we go. Next we’ll look in a bit more depth at how branches work, what they mean and how to work with them.

Advertisements

November 17, 2010

A Journey Through Source Control. Part 2. Jumping to the 3rd Generation

Filed under: Development — rcomian @ 1:27 pm

Ok, I admit it, most people just don’t care about source control, and why should they? It’s a tool that they use to get things done, and so long as they don’t lose their work, who cares about anything else?

I think it’s because of this disinterested mindset that a lot of people completely missed what the 3rd generation of tools were about. And it’s not too suprising, advertising like “ACID compliant transactions – never partially commit a set of changes again!” aren’t exactly hinting at the revolutionary upheaval that had happened on the server side, it just sounds like CVS in a database.

But what happened was that the history in the repositories was completely pivoted around. CVS, VSS, et al would keep a single tree of files, and each file would have it’s own history. SVN, TFS, etc, don’t keep a set of files, they keep a set of trees. Each revision in the repository represents the state of every file in the entire source control tree at that point in time. Think of a stack of playing cards, each with the entire source control tree drawn on it. Each time you commit a change to any file in the repository, an new card is drawn up with the previous tree + your changes and put on the top of the pile. This means that the entire source control in the repository can be reliably retrieved by giving just one number – just tell the system what card you want and you can get it.

No longer do we need hacks like ‘labels’ to make sure that everything is kosher and retrievable, everything is labelled with each and every change as an automatic by-product of how the repository works. Of course, we don’t lose the ability to view the changes that were made to a particular file: given a path, we can diff the contents of the file in one revision with the contents in the next and get the changes just like we could before. But we also get a brand new ability – by getting the difference between one tree and the previous tree, we can see all the files that were changed at the same time – and it’s this set of diffs that makes up a ‘changeset’.

It’s this natural creation and handling of changesets that gives us a much better ability to migrate changes around the repository. Say I have two identical copies of a file in two places in the repository. If I make a change to file A, I can retrieve the changeset at any time I like, this results in a diff, and I can apply this diff to file B whenever I feel like it. Expand this concept to directories and work with changesets instead of individual diffs, and we’ve got the basis for a very flexible and simple branching and merging system. Branching is simply making a copy of a directory in another location, merging is simply getting the changesets that were applied to one directory, and applying them to the other. This branching and merging can all be tracked within the repository itself and voila, we have a modern 3rd generation system with full branching support.

So now we know what the differences between the 2nd & 3rd generation are, we’ll address in the next post some of the pitfalls that we fall into when we try to think of a 3rd generation system in terms of a 2nd generation system.

And don’t worry, I’ve not forgotten about the 4th generation, that’s coming, but since it builds on what the 3rd generation gives us, I think it’s important to square that away first.

November 16, 2010

A Journey Through Source Control. Part 1. Generations of Source Control

Filed under: Development — rcomian @ 1:27 pm

In this series I’m going to be looking at some of the key concepts available in source control, and how they can be applied to gain full control over your source code.

Now, in all fairness, I don’t make my living by telling people how to use source control, but it’s one of the things that fascinate me about software development, purely because it’s so underused and misunderstood by a large percentage of my fellow developers.

The first thing that I think is important to grasp is that we’ve already churned through several generations of source control tools. I can easily identify 4 generations based on just the surface architectures involved, and each one represents a significant advantage over the previous generations.

Source Control Generations
Generation Feature Example
1 History of individual files SCCS/RCS
2 Networked server CVS/VSS
3 History of whole trees SVN/TFS
4 Distributed repositories Git/Hg

So with the first generation of tools such as RCS, we had the ability to keep the history of the files we were working on. That was pretty much it, really. Working with multiple files was tricky at best, and sharing wasn’t exactly safe, as everyone would work in the same shared folder.

CVS & VSS came along and rescued us from this situation by wrapping RCS and keeping all it’s files in a repository, which could be accessed through a server on the network. This meant that all developers could have their own copy of their files on their machines, isolated from everyone else. Then when work was complete, the changes could be sent up to the server and new changes brought down.

This CVS style working is all very well, but it still has problems – namely that files are still versioned individually and that it can be very hard to reliably isolate changes that happened together across a range of files after the fact. SVN and others came on to the scene to rectify this by keeping all the revisions in a single stack of changes, numbered sequentially. Changes to multiple files could be kept within a single revision, or changeset, and you could see not only what changes were made to a particular file, but drill down into each change to find out what the full story was each time. This also makes branching and merging a far more tenable idea since a complete ‘changeset’ could be reliably migrated in full across different branches.

And most recently, after having 3 generations of centralised repositories, Git, Hg and others threw away the need for an enforced central repository. The effect of this is that it no-longer matters exactly where source code is located – what matters is the actual content itself, which could come from anywhere. Whether I pulled revision ’24g4g2′ out of the main company server, a junior developer’s laptop, a pirate website or the smoking remains of a tape in the server room, I would be guaranteed that the code was exactly what I was after, including all the history up to that point. Central servers are now used at your team’s convenience, not just because the tool demands one.

So there we go, a very short, and incomplete, potted history of source control tools.

Next we’ll look in more detail at the jump from 2nd generation to 3rd generation – and how so many people are thinking of it the wrong way round.

April 7, 2010

Straightening out the development cycle

Filed under: Development — Tags: , , — rcomian @ 12:04 pm

This is me trying to get things sorted in my head.

Know your features

Features are your lifeblood. Organise them, manage them, combine them into stories. Process them with great care. Tuck them in and sing them to sleep at night.

All features are great. It’s the organisation of them that gives you the edge. A small set of coherent stories that each stand on their own will be far more useful than a complex network of micro abilities. A well managed, complex set of micro-abilities will be far more use than a vague idea that something might be good.

Don’t go mapless

It’s usually a good idea to have a high level idea of what the overall solution can look like. This lets you communicate the overall idea, discuss how things can be done and ensure that the whole idea is indeed possible. This map may include all the normal good things, asynchronous message manipulation manifolds, distributed data denegrators, partial parallel processors et al.

Keep the map light

Rather than follow the waterfall model of expanding this map until it’s a thousand page document, use a lightweight method to convey the idea. If that’s a block diagram on the back of an envelope scrawled in a coffee shop, or a powerpoint presentation, that’s fine, but convey the whole idea.

Implement features, not the map

Ignore the map. Pluck a single feature from your carefully tended garden. Build the feature as if it was the only thing that existed in the world. Don’t look to the other features to see what they need, look just at what you’re working on right now. Don’t look at the map and decide that you need the main message manipulation manifold if a simple web service will be more than sufficient for this feature. If you’re working with a single, simple document, save it in a file. Don’t add that database just because it’s on the map, follow what the feature needs.

Finish the feature

When a feature is complete, it looks like you knew what you were doing when you started writing it. It looks like you had the same thing in mind all the way through, and weren’t flip flopping from one way of doing things to another. Of course, you didn’t know what you were doing when you started, and you did flip flop all over the place. But don’t let that show on the finished product.

Lock in the feature

Have a way to automatically confirm that the feature works. Confirm this on many levels, from the outside (feature tests) to the actual code (unit tests) and inbetween (integration tests). Once a feature is locked in place, you can rest in a comfort that you know it’s not going anywhere, it’s not going to be broken and you can leave it alone and work on other things. It’s very hard to properly lock in a feature after it’s finished, so start locking it in before you start the feature. Put a lock in place for each part of the feature, then put that part in place.

The locks are part of your feature too, shoddy locks do a shoddy job of locking your feature in place.

Check your locks

Make sure you know when locks are broken. Run continuous builds on all branches of your code. Keep stats: build time, unit test code coverage, static analysis, build warnings, anything that lets you see trends in how the software is progressing. Check them.

Rearrange from the ground up

After you’ve added a new feature, the whole solution should look finished. As you add new features, don’t just bolt new code on top of the old, but make it look like the whole solution was written with the intention to have all the features that are available from the start. No code is sacred, no code is untouchable.

Lean on your locks to ensure that you’re not losing functionality or breaking a feature. Lose any locks that aren’t needed any more.

Keep learning

Use what works, find what doesn’t. This doesn’t need any special discipline, it’s what you’re bitching about as you’re writing your code. Your product has a lot to teach you. Learn it. Change things.

March 9, 2010

Normalised XML

Filed under: Development — Tags: , — rcomian @ 9:09 am

Compound data in Xml is a problem that’s been highlighted many times before by others, but it’s something that still occasionally trips developers up.

Every now and then you’ll come across an Xml document that stores multiple data items in a single location. A good example is the settings* or vsprops** file for visual studio 9, which stores C++ search paths inside a single element – go look, it’s fun***. You usually find these when you’re trying to take advantage of one of the main strengths of Xml – ie, tool support like XPath or Xslt. In general, things that mean you don’t have to write or import specialised parsing libraries to manipulate and use the data.

Coming across one of these compound formats when using a generic tool means that work inevitably grinds to a halt. As if XPath and Xslt aren’t hard enough, now you’re thinking of using regex, or getting the values into variables and writing some procedural code, which incidentally, doesn’t fit at all inside your xslt, so now you have to completely change your approach to the problem. A simple tool that was to being scripted from a commandline has become a real development effort.

And it’s not just for scripting languages, even making use of these values in normal code can be odd. There you are with your document object, just extracting and using the values in a single operation as and when you need, then all of a sudden, you’re calling ‘split’ and iterating over the results. Of course you’re also throwing away the empty results at the end of split – because whatever formats the data for that value always puts a delimter on the end of every string. But you’re not throwing away all empty results though, are you, because an empty value in the middle of the sequence is fine, it’s just the last one that gets ignored.

And the Xml gets very hard to read very quickly when we’re using compound data. Do your individual values need quoting? Ok, now your document contains a load of unnecesary " values well. And exactly when is that ; the end of an entity, and when is it a delimiter?

There are many really subtle complexities like this that can come up when you’re storing data in your own format. The more you’re stuffing into that value, the more likely you are to come across the awkward edge cases.

If you’re anything like me, when you’re writing this complex split logic, or trying to visually parse out the quotes from the semi-colons, you’re thinking: But formatting data like this is what Xml is for – why am I having to write this code?

There’s many possible causes of this stuffing. Xml does look very inefficient, a single character delimeter takes a lot less space than the bracketing of element names. But if you’re using Xml at all, is that really a concern? And if it is, would zipping the results give you similar gains? Xml compresses extremely well, and the various zips are all fairly ubiquitous tools that won’t trip up too many situations.

The other main reason, is of course, the one size fits all schema that doesn’t extend. But if that’s your schema, shouldn’t you be designing something thats flexible enough to at least store multiple values? Name-value pairs are great, but an Xml document that’s a glorified name value pair system is a bit weak.

Most people are pretty good at designing reasonably normalised databases, so why are we still stuffing so many values into a single column when it comes to Xml?

* In the case of the settings file, it’s not just an array, but an array of arrays. The outer array consists of the pattern: Platform1|Paths1|Platform2|Paths2. The paths themselves are stored as semi colon delimited paths: path1;path2. Good luck pulling the first search path for Win32 out of that with XPath, or even regex.

** VSprops is a single array of semi-colon separated paths, but often stores them quoted, for some reason. So you literally get the construct: "path1";"path2";.

*** For some definition of fun.

February 26, 2010

Commit Comments

Filed under: Development — Tags: , , — rcomian @ 9:06 am

There’s been a few blog posts recently about commit comments.

James Lorenzen: http://jlorenzen.blogspot.com/2010/02/commit-comments-conversation-with-your.html

A followup from Kohsuke: http://weblogs.java.net/blog/kohsuke/archive/2010/02/25/what-do-you-try-leave-your-commit-messages

I know that comments can often be treated as one of those horrible tick box enforcements from on-high. However, I’ve eventually come around to a massive appreciation of them, and try to include as much detail as is sensible in my comments. For one thing, it confirms to me that I’ve made a specific change for a specific reason, and not just ‘done some stuff’.

Like eveyone else, I’m curious on how other people view commit comments? Are they unnecessary? (after all, we have diffs). Is just a bugId enough? Should they be short one liners or mini release notes?

Is there scope for bringing commit comments into other parts of the documentation, ie, on the bug reports, release notes, etc.

What, if any, resistance do have to writing a comment when it comes time to checkin?

January 4, 2010

Unit Testing Primer

Filed under: Development — Tags: , , , , — rcomian @ 11:24 am

At the TechEd conference I was able to listen to Roy Osherove give his talk on unit testing. Osherove is, as far as I can tell, one of the leading figures in unit testing and test driven development and his talks give the impression of someone who’s come to a very deep understanding of the subject, born through a great deal of experience and insight.

I’m going to try to cover the main points that Osherove gave in his talk, and the video will be available to those who wish to see it. In this post I’ll lay out the primers that are needed to understand how to write good unit tests. I don’t count myself as a unit testing guru, but I have made enough mistakes in this area to feel that I get where Osherove is coming from.

The first thing to understand is what a unit test is. A unit test does not rely on any environmental factor of any kind. This means no filesystem, database, network, specific user accounts or other tests. Anything that is relied on by the test is explicitly setup within the test. This also includes any functions that don’t return the same values each time they’re called – think DateTime.Now or Rand(). Any environmental dependance such as this means that the test must be classified as an integration test. Integration tests are a different kind of beast, and although still vital to the health of your software, are not the unit tests that we’re talking about here. It turns out that I’ve only ever written very few unit tests, the vast majority have been inadvertant integration tests.

The second thing to understand are the three pillars of unit testing: “Readability”, “Maintainability”, “Trustworthiness”. Each pillar holds up the other two pillars, you cannot have a maintainable test that isn’t readable. You cannot understand a test, you’re not likely to trust it. So if you find yourself criticizing a test based on any of these three pillars, then it’s likely that the test is not a good test and isn’t helping your software.

This leads on to the third fundamental point. A test suite filled with tests that aren’t good will not help us develop good software. In fact a bad test suite can be detrimental by forcing us to pay ever increasing testing taxes without them paying any dividends.

It turns out that there is a very clear pattern to good tests. They’re not really like normal code, so much, more like recipes. An template might be:

void <method under test>_<condition of test>_<result of action>()
{
    <initialise test items and condition>

    <perform tested actions>

    <assert on result>
}

In addition to this template, there are a few other fairly unexpected guidelines:

  1. Do not use conditional logic in the test (no IF’s or SWITCHes). Each branch of the logic is a separate test condition – split out to separate tests.
  2. Don’t assert in loops. Each loop is a separate test condition, split them out or recognise that they’re duplicates and remove them.
  3. Assert only 1 thing per test. Multiple asserts are generally testing for different results or conditions, so they’re separate tests.
  4. Give context-relevant names to the values being used. For example, calling doSplit(INVALID_LIST, UNUSED_FLAG) says more than doSplit(“here there”, false)
  5. Separate out each step. Whilst you might find that you can initialise, act and assert all on one line, it doesn’t mean that it’s readable.
  6. Do use standard refactoring techniques to re-use common code parts, but all the above advice still holds for these refactored methods.

He also gives some suggestions for the lifecycle of tests:

  1. Add new tests when new methods, conditions or results become relevant.
  2. Remove tests when they become irrelevant.
  3. Only modify tests if the API changes or they need refactoring.

This is a very quick distillation of the advice given. Osherove himself provides a lot more detail and justification for each point here.

See his site and other links at http://osherove.com/

December 21, 2009

How to reference shared code

Filed under: Development — Tags: — rcomian @ 12:42 pm

One of the things that isn’t entirely obvious with source control is how to go about managing shared code and libraries.

The point is to keep your build stable and clean throughout its lifecycle whilst making life as easy as possible when releasing, branching and merging.

Keep in mind the main tenant of good source control is:

The ability to get any version of a project from a single location and build it with one command.

If we can do this, it’s an indication that we’ve got good control over our source. It’s kind of a litmus test – not a goal in itself, but if we can do this, then we will have automatically enabled a whole host of strong abilities.

In some respects, this goal will never be achievable. We’ll always be dependant on some aspects of the developer environment. At the very minimum, we’ll have to have the source control client and the base OS available through some other mechanism before we can do this. We’ll probably also need the compiler and some of the more common libraries and header files that go with them. But if this environment is small, well documented, stable and backwards compatible, then it really won’t be too difficult to manage.

But once the environment is sorted, we then need to find a way to manage the softer dependencies of our projects. The main ways of handling these are:

  1. Have everything in 1 huge project.
  2. Branch shared projects into each project as required.
  3. Reference shared projects through some mechanism like svn:externals.
  4. Install dependencies separately into the environment.

Each method has its pros and cons:

Pros:

  1. Simple to understand. Everything is always available.
  2. Have strong control over the version of the dependencies being used. Can make controlled local customizations to the dependencies.
  3. Lightweight. Strong documentation in all cases over what uses what. Can switch to unreleased versions to make library changes in parallel to project changes.
  4. Easy to understand. No reliance on source control mechanisms. Integrates well with platform abilities (ie, registered COM objects, plug-ins, registry settings, etc).

Cons:

  1. Breaking changes immediately affects all users of a library. History for a project does not include useful references to the changes in its libraries. Tendency for projects to become tightly coupled. Hard to know what uses what. Stability means getting and building everything with one checkout. Difficult to reliably branch just a single project. No real definition of project boundaries.
  2. Updating to new versions of libraries can become laborious. Can get to rely on customizations when not appropriate. Usually reliant on a single location to merge changes from (ie, a “latest version” branch) which can make having parallel versions interesting. It can get confusing to tell exactly what library version is being used when merges come from different places.
  3. Difficult to customize libraries. Needs source control support. Need to be aware of the stability of what you’re referencing. Tendency to recurse dependencies and cause conflicts and circular references.
  4. Installing deps is a manual process. Need to be told when dependencies have changed. Installation/Upgrade/Removal is a heavy weight activity. Complicates release of dependencies. Installations are machine wide which makes working on parallel projects/versions difficult. Locating source code can be awkward.

As with most things, you’ll probably use a mixture of all the techniques in real life. For example, having a compiler or IDE installed is an example of using option 4. Merging two or more projects together into a larger release unit is an example of option 1. Use of vendor branches as defined in the subversion book is option 2. Almost every use of externals is option 3. Of course, the trick is to find the appropriate times to use the techniques and to know the technical difficulties involved with each.

When faced with the issue of shared libraries, I tend to favour option 2 for its tight control over exactly what gets released. However, it’s a fairly heavy-weight solution and only really comes into its own when customizations have to be made on a per-project basis. Option 3 provides a lighter weight solution that provides the same guarantees for the stability of builds. The underlying idea with option 3 is that there is some setting that’s kept alongside the source for a project which indicates what dependencies are required and where to put them.

The biggest problem is the requirement for source control support to make use of this setting. Subversion has this built in with the svn:externals property. TFS has the ability to check out multiple locations into a defined tree structure, but provides no built-in mechanism to store this information in a relative form alongside the source code. It isn’t too hard to provide tooling to do this, however, and a well defined XML file will serve exactly the same purpose as the svn:externals property. How to do this reliably with distributed sytems like mercury and git is one of the things that I just don’t understand about those systems, possibly option 2 works better in those environments.

With option 2, it’s pretty obvious that when you tag, branch or merge a project, the required changes to the dependencies are also tagged, branched and merged. There’s no need to branch and then stabilise the references, the code is already the right version. Option 3 is slightly more confusing, but if you think of option 3 as a shortcut of option 2, it becomes pretty clear. To get control over your builds, the references must always point to a stable target. Either a tag, a label, a specific changeset, or whatever stable reference the source control provides. If you do this, then you can be sure that whenever a version of a project is retrieved, that the correct version of the dependencies will be retrieved as well and that the right things will get built. This should be something you can guarantee for every branch you care about from a release perspective, which at a minimum will be trunk and all maintenance branches. If you do this, when you label up a branch as a release, you can be sure, with no further work, that you’ve labelled everything that’s required for that build.

One advantage option 3 has over option 2 is that you can relax the stable version requirement when it’s appropriate. For example, you can tell source control to get the trunk for a particular dependency. For your working copy, this is no more taxing than using your source control to get the appropriate version – you don’t need to change and checkin your dependency file/property. For unstable or early development branches, this unstable reference can be checked in, with the understanding that you are intrinsically breaking the build guarantee for that period on that branch.

Conflicts are also highlighted in a more useful way than with option 2 – say a branch is taken to upgrade from version 1.1 to 1.2 of a library, but in the meantime trunk starts making use of version 1.1.1, this will be highlighted as a conflict in your dependencies when you come to merge the branch in. At that point you can determine the correct version of the library that should be used (maybe 1.2.1) to include all features and bug-fixes that are relied on. With option 2, you get a potential mish-mash of versions which may or may not work, you may have to re-do conflicts which were already resolved when making version 1.2.1 and you no-longer have the ability to definitively say what version of the library you’re using (it’s 1.1.1 with some 1.2 thrown in). It may not even be obvious to the person doing the merging that the conflicts are happening in a shared library, depending on the structure of your build environment.

All in all, option 3 appears to be by far the most practical method of sharing code whilst keeping a sane track of your development environment. I strongly advocate this method whever I can. But do be aware of the other options, they are very good and valid responses to other kinds of problems that we all face when doing development. But default to option 3.

December 16, 2009

The significance of “Get Latest”

Filed under: Development — Tags: — rcomian @ 10:15 am

It’s a sad fact that many of us still think of VSS as a form of source control. For a large number of developers, it was indeed our first encounter with that kind of technology, and this experience has massively coloured our view of the landscape in a hundred subtle ways.

The model used by source safe for sharing changes is the lock-edit-commit sequence. When we want to change a file, we lock it, edit it, then send the changes back to the server. On the face of it, it’s a pretty good working model, we saying ‘this is mine’, doing what we like and then when we’re ready to give up control of it, we send it back to the server.

Sharing aside, when you’re working this way, it has an impact on how you view a particular action – “Get Latest”. When your whole work area has been ring-fenced from everyone else, getting latest is a pretty cheap operation. After all, you get all the features and bug-fixes that other people have added, and you know they don’t impact your current work because your relevant files are locked and you know they haven’t been changed. So in this world, get latest basically means, get the stuff that everyone else has done to the rest of the system.

In the real world, we all work with an edit-merge-commit process (or possibly edit-commit-merge if you’re in git). In this model, the significance of getting latest is very different.

When you start work, you get a version of the project onto your machine. You then edit that version and at some point, you’re ready to put the changes back into the project. However, the project has been worked on by other people and is likely not the same as the version you originally coded against. If you punted your changes up to the server, there’s no guarantee that your changes are relevant and don’t conflict with work that others have done.

This is where a ‘get latest’ operation comes in. Your not really saying ‘get the work that everyone else has done’, what your saying is ‘Migrate my changes to the latest version of the project’. Only once you’ve done this can you commit your work. Think of a project as sequential layers of changes applied on top of each other. To make the next layer, you need to apply your changes to the top layer. Realise that each and every layer goes across the entire project, a layer may only change one file, but it’s a whole new version of the code base.

Of course, there’s always a risk in migration. Distributed systems like Git are the only source control that handle this risk properly, but TFS and SVN have their own methods. The risk is that the migration may go wrong and you’ll lose your changes. This risk is highest when there are a lot of complex conflicting changes. This is a risk that you’re saying your willing to take whenever you say ‘Get latest’.

Whilst this is a risk, it’s one we all take every day, just like crossing the road. It’s one of the reasons we say “commit often”, it’s so that the resulting merge isn’t too painful. But if that risk sounds too scary for a particular set of changes, you have 2 options. In TFS you can shelve your changes, in SVN you can branch from your current working copy. Your changes are then safe, and if merging your changes into the latest version fails horribly, you can always discard the changes and go back to the shelveset or branch. Git et al work differently, in that you commit your change as is and then merge them with the latest version of the project as a separate, distinct activity.

But the key thing to take away from this is to reverse the way we often think of “Get latest”. We’re not really getting other peoples work, we’re migrating our changes to the latest version of the project. The difference may seem subtle, but it’s important. For example, it highlights the fact that it’s not really safe to “Get latest” on just the files we’ve edited – it means we’ll have a horrible frankenstein project that’s got some changes from here, some changes from there. It may work for us, and any source control system will let you commit, but it doesn’t reflect the project that we’re actually making our changes to.

December 2, 2009

How to fragment your database

Filed under: Development — Tags: — rcomian @ 12:19 pm

One thing that can affect an index’s performance is fragmentation.

Fragmented indexes are just inefficient indexes where the storage for them has been scattered about the storage area, meaning that the engine can’t use any read-ahead optimisations to yank in the data quickly, and resorts to flitting the read heads around the disk like a humming bird on speed.

Clustered indexes are the indexes that determine the order that data is stored on disk. If your clustered index is fragmented, not only is the read head flitting around to load in the index, it’s also flitting around to load in the data.

In this way, fragmented indexes can kill performance on any media that has a seek time. If the data and the index can be kept in ram, then it’s not too much of a problem, but before the data’s cached, or once your workloads become larger and hits a rotating disk, you’re out of luck. Fragmented indexes are also larger than non-fragmented indexes as they contain a lot of wasted space, which increases the chance that you’ll need to hit the disk.

There’s a very efficient way to fragment your clustered index. When you create a clustered index, what you’re saying is “This is the order that data is going to be stored in”. If you insert data in that order, then it just gets continuously appended to the data store in a very fast and efficient operation. If you don’t insert data in that order, then you’re going to fragment your index.

Mis-ordered data will force the engine to go back into the nice clean store and split one of the nice storage pages into 2 to make room for the new data. The new page goes at the end of the storage, meaning that the index now has 1 page out of order. If you’re inserting a lot of mis-ordered data, then rinse and repeat. A single page could get filled and split hundreds or thousands of times, littering pages all over the store.

You can give yourself some headroom for these events by using a fill factor with your indexes. This tells the engine to only fill pages so full (say 80%), before starting a new page, in this way giving you space to insert mis-ordered data at a later date without creating a new page.

This is fine if data is mainly ordered and sometimes comes in disordered, but it won’t particularly help if you’re inserting too much mis-ordered data, as it will just spill over the page regardless of how much headroom you leave.

One example is Guids. Using a random guid as an identifier is ok, but as a clustered index column it’s a nightmare. Sequential Guids generated by the db are ok, but if they’re assigned by the client, or with the NEWID() function, you’re basically inserting random data into your ordered index. This can’t end well. You may do this without realising it, since Management Studio to to make your primary key into a clustered index by default, so if you’ve got a Guid as a primary key, chances are it’s also a clustered index.

This is a SQL script to demonstrate fragmentation on clustered guid indexes:

CREATE TABLE dbo.GuidTable ( GuidColumn uniqueidentifier NOT NULL, SomeData int NULL )
GO
ALTER TABLE dbo.GuidTable ADD CONSTRAINT PK_GuidTable PRIMARY KEY CLUSTERED ( GuidColumn )
GO

SELECT * FROM sys.dm_db_index_physical_stats (db_id(), object_id('GuidTable'), NULL, NULL, 'DETAILED')

SET NOCOUNT ON

DECLARE @Counter INT
SET @Counter = 0
WHILE @Counter < 1000000
BEGIN

 SET @Counter = @Counter + 1

 INSERT INTO GuidTable (GuidColumn, SomeData) VALUES (NEWID(), @Counter)
END

SELECT * FROM sys.dm_db_index_physical_stats (db_id(), object_id('GuidTable'), NULL, NULL, 'DETAILED')
GO

ALTER INDEX PK_GuidTable ON GuidTable REBUILD WITH (ONLINE = ON)

SELECT * FROM sys.dm_db_index_physical_stats (db_id(), object_id('GuidTable'), NULL, NULL, 'DETAILED')
GO

DROP TABLE GuidTable
GO

This script should give 3 resultsets as output. These are reports into the fragmentation at 3 points – after table creation, after data loading and after defragmentation.

You should see that after loading the fragmentation is up over 98%. After defragmentation, it’s down to around 1%.

In general, it’s good housekeeping to keep your indexes at low defragmentation. Obviously, if the fragmentation doesn’t affect the performance of the database your working with, then it’s a waste of time worrying about it. But do be aware of it, because it’s possible that if your workload increases, you could reach a tipping point where the suddenly fragmentation does matter, and matter bigtime.

We can always defrag indexes, the script above shows how to do this in Sql2005+. But prevention is usually better than cure, so think about what you’re saying when you create a clustered index. Think about the order that data actually gets inserted in and whether that could have an effect the performance as a whole, not just the time to insert the data.

There are many other scenarios where fragmentation can occur. For example, if you’re loading in sales data from remote shops on a nightly job, you may think it sensible to order that data by the purchase time. But if you load in one shop’s data, then the next, then the next, even if you order each shop’s transactions, the times may look like this as they get loaded in: 9:01, 10:30, 14:35, 8:45, 10:52, 18:16, 9:11, 9:16, 14:30 – data that is very effectively mis-ordered. This may not be as big a problem as Guids, as you might be able to determine a fill factor that actually gives you enough space to load in the shops data without fragmenting too much, but again, it’s something to be aware of when designing your databases.

« Newer PostsOlder Posts »

Blog at WordPress.com.