Right, let me warn you in advance that what follows is going to be deeply geeky and while I’ll try my best to put this in layman’s terms it may only make full sense if you possess at least a fair understanding of how databases work.

This evening, in common with a number of other bloggers (judging by the comments) I acquired a copy of the internal emails relating to the two CDs of child benefit data that have been lost in transit between HMRC and NAO from Nick Robinson’s blog on the BBC website. To save time I’ve uploaded the emails (pdf) to MoT, where they can be downloaded from this link:

informationrelatingtochildbenefitdata.pdf

I’ve since read the emails (naturally) with an eye to trying to understand the technical aspects of this whole farrago, and what I’ve found raises even more questions as to what’s going on HMRC.

For our starting point, lets take these two emails:

hmrc4.gif

hmrc3.gif

We’re close to the start of the whole mess, with HMRC sending some sample data between sections and then on to NAO for analysis in advance of providing a full set of data, to allow NAO to figure out how the information is recorded and what information they actually want.

Now let’s get a bit technical.

The live system, from which the data on the two lost CDs was extracted, runs on a mainframe computer and this, therefore, limits considerably the range of possible systems that could be in use to either a completely bespoke system or – much more likely – a bespoke front end system linked to a corporate-grade relational database management system, something of the order of IBM’s Informix or DB2 platforms, Oracle, Sybase or CA-IDMS.

We’re talking big league systems here and that, in turn, places definable limitations on the technical aspects of moving that amount of information (25 million records) from A to B (or rather from HMRC to NAO).

In practice, you have two possible options: the information could be sent in a proprietary binary database format, in which case there has to be a compatible database platform at the NAO’s end of the transaction in order for them to get at the data, or you have to send the information over in a common text-based format, such a comma separated variable (CSV) file, an SQL ‘dump’ (a text file containing a string of SQL commands required to recreate the database on another compatible database server) or, if you’re really up to date, an XML document.

The reference to the NAO having to ‘decipher’ the information militates very strongly against the possibility of the transfer of information being carried out via a proprietary binary database format – if that was the method in use then the NAO would simply import the data files into its database server and set its own techs to work on extracting the relevant information via the database platform. ‘Decipher’ in this context, makes sense only if a text-based format is being used, and particular one that does not contain information about the structure of the data, which almost certainly rules out XML and suggests than an SQL dump is unlikely to have been used, as the NAO would not have to decipher the data, only the SQL commands relating to creating the database structure necessary to house the data.

(The comments also strongly suggest that the level of technical understanding amongst those making the key decisions here is nowhere near what it should be.)

Moving on, HMRC received a reply on the same day that sheds rather more light on the technical situation.

hmrc2.gif

The key point is (c) where the NAO asks whether the data file will ‘have initial headings’ – that clearly rules out the possibility of the data being in the form of an SQL dump (and XML as well) or the use of a binary format – what is being referred to here is a text file, most likely a comma separated variable file.

I suppose I should, at this point, explain what a comma separated variable file (CSV) is and how it ‘works’.

A CSV file is a text file containing a number of ‘records’, each of which is made up of a set number of ‘fields’, each of which is a single piece of information. Each field, and each record is separated from the next using a specific character to indicate the end of a field and the end of a record. Let’s use a simple example of what a record looks like.

First Name, Last Name, House Number, Street, Town, Post Code

Each item of data (field, i.e. First Name, Last Name) is separated from the next using a comma and there’s a carriage return/line feed (the character(s) you get when you press the enter key while typing, at the end of each record.

When you import a CSV file into a programme that handles this format, a spreadsheet like Excel or a database programme like Access, these two special characters (commas and carriage returns/line feeds) are used by the programmes import routines  to split the data up into discrete records, and split the data in each record up into its constituent fields.

By long established convention, the very first record in data file in this format can be used to send information which identifies the nature of the contents of each field, so if you have simple address file that looks like this:

Record Number, First Name, Last Name, House Number, Street, Town, Post Code

1, Joe, Bloggs, 10, Any Road, Any Town, AA1 1AA

2, Fred, Smith, 11, Another Road, Another Town, XX1 1XX

The first line (record) gives you the headings that tell you what type of information should be in each field of the records that follow – and if you don’t send that information then the end user will have to add it themselves in order to make sense of the data, hence the question about initial headings; and you would only ask this question if what you were being sent was a text file in CSV or similar format – in the case of the other possible formats the information about what each field contains is ‘built’ into the file and need only be read by the relevant software.

Why is this relevant?

Well, because if CSV file or a similar text based file was being used then, provided one knows which special characters are being used to delimit fields and records, the various tasks that the NAO is talking out in this e-mail, i.e. segregating certain information and dropping other items (e.g. bank information) could be undertaken on a reasonably well specified desktop PC by anyone who is fairly competent in working with common spreadsheet or database applications, like Microsoft Excel or Access, both of which can import this format.

The biggest obstacle facing anyone handed this task in no more than the sheer amount of data than would need to be imported/processed – a single file containing 25 million records would be too big for Excel or Access (using its standard database engine) to swallow in a single sitting.

However, if we assume the data filled the two CD’s completely (1.3Gb) and a typical zip compression rate of around 80-85% for text, there would be around 6.5-8Gb of data in total to be processed, then using Access as a front end to a Microsoft SQL database, or MySQL (which is free) would be an option, as would using a relatively cheap piece of text processing software to split the file into ‘bitesize’ chunks of a size that could then be imported to Excel or Access. And that’s assuming you stick with Microsoft, as this is one of those jobs that the various limitations of Windows and M$s’ other applications makes tricky but which would actually be relatively easy to pull off on Linux box.

This is a job that could be done on a decent desktop PC by a competent junior IT/Application support technician in a couple of days (allowing for processing time) at very little cost in real terms, and – coming from the other direction – unless we’re dealing with a completely bespoke system, it also a job that EDS could have done as an overnight batch run for very little cost either. If the live data is stored using a mainstream corporate database platform, then I’d estimate no more than a half day’s coding/testing time to write the batch job and query(ies) necessary to extract the precise (and sanitised) data that the NAO indicated that it required.

All of which brings me to the critical email.

hmrc1.gif

This is the now infamous email in which HMRC bleats about the additional cost of providing the precise information that the NAO requires (never mind the cost to the NAO of it doing its own data extraction on such a large dump file) and I’m just not convinced.

There is something serious fucked-up in an IT services contact in which the provision of audit information is a chargeable extra – the correct response from HMRC in EDS responds to such a request with ‘it’ll cost you £x if we do this’ is ‘it’ll cost you the contract if you don’t’ and based on what I can glean from these emails this should not be an expensive task unless you’re dealing with a completely bespoke system and need a custom programme written specifically to do the scan and extract.

From what I can see, part of the investigation into the circumstances in which this data was lost needs to look very carefully at HMRC’s contract with EDS – audit information should be an optional extra – and at whether HMRC actually enquired as to the cost of sanitising this data – the Telegraph has gone on record as suggesting that the actual cost to HMRC would have been in the region of £5,000:

The Daily Telegraph – which has revealed how the government had ignored repeated warnings over the safety of personal data – has established that HM Revenue and Customs officials could have avoided the scandal if they had paid £5,000 for computer data supplier EDS to edit the data on the discs.

Five grand still seems a bit on the high side, but compared to the potential costs if the CDs did get into the wild it’s a pittance.

Fuck up from start to finish doesn’t come close.