Developer Drain Brain

December 21, 2009

How to reference shared code

Filed under: Development — Tags: — rcomian @ 12:42 pm

One of the things that isn’t entirely obvious with source control is how to go about managing shared code and libraries.

The point is to keep your build stable and clean throughout its lifecycle whilst making life as easy as possible when releasing, branching and merging.

Keep in mind the main tenant of good source control is:

The ability to get any version of a project from a single location and build it with one command.

If we can do this, it’s an indication that we’ve got good control over our source. It’s kind of a litmus test – not a goal in itself, but if we can do this, then we will have automatically enabled a whole host of strong abilities.

In some respects, this goal will never be achievable. We’ll always be dependant on some aspects of the developer environment. At the very minimum, we’ll have to have the source control client and the base OS available through some other mechanism before we can do this. We’ll probably also need the compiler and some of the more common libraries and header files that go with them. But if this environment is small, well documented, stable and backwards compatible, then it really won’t be too difficult to manage.

But once the environment is sorted, we then need to find a way to manage the softer dependencies of our projects. The main ways of handling these are:

  1. Have everything in 1 huge project.
  2. Branch shared projects into each project as required.
  3. Reference shared projects through some mechanism like svn:externals.
  4. Install dependencies separately into the environment.

Each method has its pros and cons:

Pros:

  1. Simple to understand. Everything is always available.
  2. Have strong control over the version of the dependencies being used. Can make controlled local customizations to the dependencies.
  3. Lightweight. Strong documentation in all cases over what uses what. Can switch to unreleased versions to make library changes in parallel to project changes.
  4. Easy to understand. No reliance on source control mechanisms. Integrates well with platform abilities (ie, registered COM objects, plug-ins, registry settings, etc).

Cons:

  1. Breaking changes immediately affects all users of a library. History for a project does not include useful references to the changes in its libraries. Tendency for projects to become tightly coupled. Hard to know what uses what. Stability means getting and building everything with one checkout. Difficult to reliably branch just a single project. No real definition of project boundaries.
  2. Updating to new versions of libraries can become laborious. Can get to rely on customizations when not appropriate. Usually reliant on a single location to merge changes from (ie, a “latest version” branch) which can make having parallel versions interesting. It can get confusing to tell exactly what library version is being used when merges come from different places.
  3. Difficult to customize libraries. Needs source control support. Need to be aware of the stability of what you’re referencing. Tendency to recurse dependencies and cause conflicts and circular references.
  4. Installing deps is a manual process. Need to be told when dependencies have changed. Installation/Upgrade/Removal is a heavy weight activity. Complicates release of dependencies. Installations are machine wide which makes working on parallel projects/versions difficult. Locating source code can be awkward.

As with most things, you’ll probably use a mixture of all the techniques in real life. For example, having a compiler or IDE installed is an example of using option 4. Merging two or more projects together into a larger release unit is an example of option 1. Use of vendor branches as defined in the subversion book is option 2. Almost every use of externals is option 3. Of course, the trick is to find the appropriate times to use the techniques and to know the technical difficulties involved with each.

When faced with the issue of shared libraries, I tend to favour option 2 for its tight control over exactly what gets released. However, it’s a fairly heavy-weight solution and only really comes into its own when customizations have to be made on a per-project basis. Option 3 provides a lighter weight solution that provides the same guarantees for the stability of builds. The underlying idea with option 3 is that there is some setting that’s kept alongside the source for a project which indicates what dependencies are required and where to put them.

The biggest problem is the requirement for source control support to make use of this setting. Subversion has this built in with the svn:externals property. TFS has the ability to check out multiple locations into a defined tree structure, but provides no built-in mechanism to store this information in a relative form alongside the source code. It isn’t too hard to provide tooling to do this, however, and a well defined XML file will serve exactly the same purpose as the svn:externals property. How to do this reliably with distributed sytems like mercury and git is one of the things that I just don’t understand about those systems, possibly option 2 works better in those environments.

With option 2, it’s pretty obvious that when you tag, branch or merge a project, the required changes to the dependencies are also tagged, branched and merged. There’s no need to branch and then stabilise the references, the code is already the right version. Option 3 is slightly more confusing, but if you think of option 3 as a shortcut of option 2, it becomes pretty clear. To get control over your builds, the references must always point to a stable target. Either a tag, a label, a specific changeset, or whatever stable reference the source control provides. If you do this, then you can be sure that whenever a version of a project is retrieved, that the correct version of the dependencies will be retrieved as well and that the right things will get built. This should be something you can guarantee for every branch you care about from a release perspective, which at a minimum will be trunk and all maintenance branches. If you do this, when you label up a branch as a release, you can be sure, with no further work, that you’ve labelled everything that’s required for that build.

One advantage option 3 has over option 2 is that you can relax the stable version requirement when it’s appropriate. For example, you can tell source control to get the trunk for a particular dependency. For your working copy, this is no more taxing than using your source control to get the appropriate version – you don’t need to change and checkin your dependency file/property. For unstable or early development branches, this unstable reference can be checked in, with the understanding that you are intrinsically breaking the build guarantee for that period on that branch.

Conflicts are also highlighted in a more useful way than with option 2 – say a branch is taken to upgrade from version 1.1 to 1.2 of a library, but in the meantime trunk starts making use of version 1.1.1, this will be highlighted as a conflict in your dependencies when you come to merge the branch in. At that point you can determine the correct version of the library that should be used (maybe 1.2.1) to include all features and bug-fixes that are relied on. With option 2, you get a potential mish-mash of versions which may or may not work, you may have to re-do conflicts which were already resolved when making version 1.2.1 and you no-longer have the ability to definitively say what version of the library you’re using (it’s 1.1.1 with some 1.2 thrown in). It may not even be obvious to the person doing the merging that the conflicts are happening in a shared library, depending on the structure of your build environment.

All in all, option 3 appears to be by far the most practical method of sharing code whilst keeping a sane track of your development environment. I strongly advocate this method whever I can. But do be aware of the other options, they are very good and valid responses to other kinds of problems that we all face when doing development. But default to option 3.

Advertisements

December 16, 2009

The significance of “Get Latest”

Filed under: Development — Tags: — rcomian @ 10:15 am

It’s a sad fact that many of us still think of VSS as a form of source control. For a large number of developers, it was indeed our first encounter with that kind of technology, and this experience has massively coloured our view of the landscape in a hundred subtle ways.

The model used by source safe for sharing changes is the lock-edit-commit sequence. When we want to change a file, we lock it, edit it, then send the changes back to the server. On the face of it, it’s a pretty good working model, we saying ‘this is mine’, doing what we like and then when we’re ready to give up control of it, we send it back to the server.

Sharing aside, when you’re working this way, it has an impact on how you view a particular action – “Get Latest”. When your whole work area has been ring-fenced from everyone else, getting latest is a pretty cheap operation. After all, you get all the features and bug-fixes that other people have added, and you know they don’t impact your current work because your relevant files are locked and you know they haven’t been changed. So in this world, get latest basically means, get the stuff that everyone else has done to the rest of the system.

In the real world, we all work with an edit-merge-commit process (or possibly edit-commit-merge if you’re in git). In this model, the significance of getting latest is very different.

When you start work, you get a version of the project onto your machine. You then edit that version and at some point, you’re ready to put the changes back into the project. However, the project has been worked on by other people and is likely not the same as the version you originally coded against. If you punted your changes up to the server, there’s no guarantee that your changes are relevant and don’t conflict with work that others have done.

This is where a ‘get latest’ operation comes in. Your not really saying ‘get the work that everyone else has done’, what your saying is ‘Migrate my changes to the latest version of the project’. Only once you’ve done this can you commit your work. Think of a project as sequential layers of changes applied on top of each other. To make the next layer, you need to apply your changes to the top layer. Realise that each and every layer goes across the entire project, a layer may only change one file, but it’s a whole new version of the code base.

Of course, there’s always a risk in migration. Distributed systems like Git are the only source control that handle this risk properly, but TFS and SVN have their own methods. The risk is that the migration may go wrong and you’ll lose your changes. This risk is highest when there are a lot of complex conflicting changes. This is a risk that you’re saying your willing to take whenever you say ‘Get latest’.

Whilst this is a risk, it’s one we all take every day, just like crossing the road. It’s one of the reasons we say “commit often”, it’s so that the resulting merge isn’t too painful. But if that risk sounds too scary for a particular set of changes, you have 2 options. In TFS you can shelve your changes, in SVN you can branch from your current working copy. Your changes are then safe, and if merging your changes into the latest version fails horribly, you can always discard the changes and go back to the shelveset or branch. Git et al work differently, in that you commit your change as is and then merge them with the latest version of the project as a separate, distinct activity.

But the key thing to take away from this is to reverse the way we often think of “Get latest”. We’re not really getting other peoples work, we’re migrating our changes to the latest version of the project. The difference may seem subtle, but it’s important. For example, it highlights the fact that it’s not really safe to “Get latest” on just the files we’ve edited – it means we’ll have a horrible frankenstein project that’s got some changes from here, some changes from there. It may work for us, and any source control system will let you commit, but it doesn’t reflect the project that we’re actually making our changes to.

December 2, 2009

How to fragment your database

Filed under: Development — Tags: — rcomian @ 12:19 pm

One thing that can affect an index’s performance is fragmentation.

Fragmented indexes are just inefficient indexes where the storage for them has been scattered about the storage area, meaning that the engine can’t use any read-ahead optimisations to yank in the data quickly, and resorts to flitting the read heads around the disk like a humming bird on speed.

Clustered indexes are the indexes that determine the order that data is stored on disk. If your clustered index is fragmented, not only is the read head flitting around to load in the index, it’s also flitting around to load in the data.

In this way, fragmented indexes can kill performance on any media that has a seek time. If the data and the index can be kept in ram, then it’s not too much of a problem, but before the data’s cached, or once your workloads become larger and hits a rotating disk, you’re out of luck. Fragmented indexes are also larger than non-fragmented indexes as they contain a lot of wasted space, which increases the chance that you’ll need to hit the disk.

There’s a very efficient way to fragment your clustered index. When you create a clustered index, what you’re saying is “This is the order that data is going to be stored in”. If you insert data in that order, then it just gets continuously appended to the data store in a very fast and efficient operation. If you don’t insert data in that order, then you’re going to fragment your index.

Mis-ordered data will force the engine to go back into the nice clean store and split one of the nice storage pages into 2 to make room for the new data. The new page goes at the end of the storage, meaning that the index now has 1 page out of order. If you’re inserting a lot of mis-ordered data, then rinse and repeat. A single page could get filled and split hundreds or thousands of times, littering pages all over the store.

You can give yourself some headroom for these events by using a fill factor with your indexes. This tells the engine to only fill pages so full (say 80%), before starting a new page, in this way giving you space to insert mis-ordered data at a later date without creating a new page.

This is fine if data is mainly ordered and sometimes comes in disordered, but it won’t particularly help if you’re inserting too much mis-ordered data, as it will just spill over the page regardless of how much headroom you leave.

One example is Guids. Using a random guid as an identifier is ok, but as a clustered index column it’s a nightmare. Sequential Guids generated by the db are ok, but if they’re assigned by the client, or with the NEWID() function, you’re basically inserting random data into your ordered index. This can’t end well. You may do this without realising it, since Management Studio to to make your primary key into a clustered index by default, so if you’ve got a Guid as a primary key, chances are it’s also a clustered index.

This is a SQL script to demonstrate fragmentation on clustered guid indexes:

CREATE TABLE dbo.GuidTable ( GuidColumn uniqueidentifier NOT NULL, SomeData int NULL )
GO
ALTER TABLE dbo.GuidTable ADD CONSTRAINT PK_GuidTable PRIMARY KEY CLUSTERED ( GuidColumn )
GO

SELECT * FROM sys.dm_db_index_physical_stats (db_id(), object_id('GuidTable'), NULL, NULL, 'DETAILED')

SET NOCOUNT ON

DECLARE @Counter INT
SET @Counter = 0
WHILE @Counter < 1000000
BEGIN

 SET @Counter = @Counter + 1

 INSERT INTO GuidTable (GuidColumn, SomeData) VALUES (NEWID(), @Counter)
END

SELECT * FROM sys.dm_db_index_physical_stats (db_id(), object_id('GuidTable'), NULL, NULL, 'DETAILED')
GO

ALTER INDEX PK_GuidTable ON GuidTable REBUILD WITH (ONLINE = ON)

SELECT * FROM sys.dm_db_index_physical_stats (db_id(), object_id('GuidTable'), NULL, NULL, 'DETAILED')
GO

DROP TABLE GuidTable
GO

This script should give 3 resultsets as output. These are reports into the fragmentation at 3 points – after table creation, after data loading and after defragmentation.

You should see that after loading the fragmentation is up over 98%. After defragmentation, it’s down to around 1%.

In general, it’s good housekeeping to keep your indexes at low defragmentation. Obviously, if the fragmentation doesn’t affect the performance of the database your working with, then it’s a waste of time worrying about it. But do be aware of it, because it’s possible that if your workload increases, you could reach a tipping point where the suddenly fragmentation does matter, and matter bigtime.

We can always defrag indexes, the script above shows how to do this in Sql2005+. But prevention is usually better than cure, so think about what you’re saying when you create a clustered index. Think about the order that data actually gets inserted in and whether that could have an effect the performance as a whole, not just the time to insert the data.

There are many other scenarios where fragmentation can occur. For example, if you’re loading in sales data from remote shops on a nightly job, you may think it sensible to order that data by the purchase time. But if you load in one shop’s data, then the next, then the next, even if you order each shop’s transactions, the times may look like this as they get loaded in: 9:01, 10:30, 14:35, 8:45, 10:52, 18:16, 9:11, 9:16, 14:30 – data that is very effectively mis-ordered. This may not be as big a problem as Guids, as you might be able to determine a fill factor that actually gives you enough space to load in the shops data without fragmenting too much, but again, it’s something to be aware of when designing your databases.

Blog at WordPress.com.