How to blow a ‘breakthrough’

Well, today’s the day that Dizzy joins the mainstream according to Iain Dale:

Eleven days ago I asked a question about Phil Hendren, aka Dizzy Thinks. I said “Why hasn’t a newspaper signed him up yet?” Less than a fortnight later, Phil makes his debut in The Times comment section tomorrow with an excellent piece on the government’s latest big brother plans. Dear oh dear, another blogger signed up by the MSM. Will it ever end?

I think the more pertinent question would be to ask why it even started.

Dizzy’s article is, not to put too fine a point a on it, an embarrassment from start to finish; and I say that not just as a political blogger but as an inveterate techie who’s worked, in the past, as a system administrator for a multinational corporation.

It could, conceivably, have been a good, informative piece on the proposed Communications Data Bill, which Gordon Brown announced last week as part of the government’s draft legislative programme for 2008/9, and perhaps it would have been had Dizzy managed to do even the most basic research into the background to the bill. But, as often seems to be the case with Dale and his little coterie of party hack bloggers, concepts like doing research and backing up your arguments with evidence are of little consequence when there’s a seeming opportunity to get in a cheap shot at the government. As a result, Dizzy’s big break turn out to amount to nothing more than a by the numbers exercise in overblown rhetoric, tendentious speculation and cod science fiction which describes an ‘Orwellian’ database system that exists only his own febrile and increasingly erratic imagination.

Let me explain by way of a fisk:

Big Brother is watching you…

…but luckily he’s overstretched and has underestimated the job of keeping track of us all

Why waste time framing a coherent argument when you can go straight for the reductio ad Orwellium:

As any on-line discussion of government database system grows longer, the probability of a reference to Big Brother approaches one.

The Government is planning to introduce a giant database that will hold the details of every phone call we have made, every e-mail we have sent and every webpage we have visited in the past 12 months. This is needed to fight crime and terrorism, the Government claims.

Seemingly, the creation of a central database may be under consideration according to the Register, which unlike Dizzy, knows what its talking about and can be bothered to make a phone call and get a comment:

The draft bill is still being considered by ministers and a Home Office spokeswoman told us no decision had yet been reached.

The spokeswoman told The Register: “Ministers have made no decision on whether a central database will be included in that draft bill.”

“Ministers have made no decision on whether a central database will be included in that draft bill.”

So its early doors at the moment, a central database is a possibility but there’s some figuring out to yet before any real decisions are taken, much as you’d expect from a draft proposal.

The Orwellian nature of this proposal cannot be overstated.

Nevertheless, Dizzy’s giving his best shot…

However, there is one saving grace for people who fear for their civil liberties. The probability of the project ever seeing the light of day is close to zero. This proposal – like so many grandiose government IT schemes before it – is technologically unfeasible.

No it isn’t, and this is where, if Dizzy had bothered to do a bit of simple background reading he might have saved himself a considerable amount of embarrassment.

Where is this proposal coming from? Well, unsurprisingly, its from a European Union directive (2006/24/EC) on data retention, as the government’s own webpage on the draft legislation explains:

The purpose of the Bill is to: allow communications data capabilities for the prevention and detection of crime and protection of national security to keep up with changing technology through providing for the collection and retention of such data, including data not required for the business purposes of communications service providers; and to ensure strict safeguards continue to strike the proper balance between privacy and protecting the public.

The main elements of the Bill are:

• Modify the procedures for acquiring communications data and allow this data to be retained;

• Transpose EU Directive 2006/24/EC on the retention of communications data into UK law.

All pretty straight forward then, in fact the directive provides a very clear and matter of fact account of the precise information that will have to be retained, the detail of which I’ll be coming to in a moment but for the time being it’s worth clarifying that the ‘details’ that the government are seeking to collate amount to no more than the data necessary to trace and identify the source of a communication, the destination of the communication, its date, time and duration and the type of communication. A ‘communication’, in this case, could be a telephone call, email, text message, access to a webpage, FTP server or Peer-to-Peer service, etc.

As the directive makes perfectly clear:

No data revealing the content of the communication may be retained pursuant to this Directive.

So, when we start to look at the technical feasibility of such a project, the first thing we need to understand is that we will be dealing with only a limited subset of all the data generated and transferred digitally across telephone networks and the internet and not the whole banana.

The current levels of traffic on the internet alone (including e-mail) would require storage volumes of astronomical proportions – and internet use by the public is still growing rapidly. Meanwhile, the necessary processing capabilities to handle such a relentless torrent of information do not bear thinking about. Modern computer processors are fast, but writing data to disks will always be a serious bottleneck.

At this point, its necessary to explain exactly what information we’re dealing with here.

To trace the source of a communication, the data required would be:

For a telephone call – the number from which the call was made and the name and address of the subscriber.

For internet access, email and internet telephony (Skype, etc.) – the user ID of the source, the user ID and telephone number used to access the public telephone system and the name and address of the subscriber to whom the phone number of IP address of the communication belonged at the time the contact was made.

Now, that’s all pretty mundane stuff – no more than the kind of information you’d expect your phone company or internet service provider to have anyway…

…and of course they do retain this information for at least a short period of time both for their own business purposes and because they are already required to retain this information for a set period of time, by law, under the Regulation of Investigatory Powers Act 2001 (RIPA).

Moving on to tracing the destination of a communication, as you might well imagine the data required is equally mundane; the phone number dialled, the phone of other destinations if calls are forwarded or re-routed, IP addresses and subscriber/owner information…

Its the same through the whole section of the directive dealing with the specifications for the data that has to be retained, which includes date and times, call durations, IMEI numbers and cell locations if mobile phones are used.

In the wrong hands this is information that could be open to misuse and abuse, the article in the Register I linked earlier briefly notes some of the data security issues – its a techie’s news service so don’t expect detailed explanations as so expects that its readers will understand what terms like data mining mean.

The upshot of all this is that while is a sizeable amount of information we are dealing with here but very little, if any of it, is not information that is routinely captured, stored and retained for short periods by telephone companies and internet service providers as a matter of routine, if not for their own routine business purposes such a billing, system maintenance, etc., then because they’re required to already by RIPA – and it at this point, where Dizzy starts trying to back his claim that we’re dealing with a ‘technologically unfeasible’ proposal that he really does drift into flight of fancy.

Take a quick sample from the London Internet Exchange, the UK’s hub and one of world’s largest points at which each ISP exchanges traffic. Yearly LINX carries at the very least 365 petabytes of data – that is the equivalent of the contents of about 26 million iPod Nanos that have the capacity to hold nearly 2,000 songs each. There is no commercial technology that is capable of writing at those kinds of speeds.

It’s not just writing that would be problematic, but the reading of the data too. It would be immensely difficult to pinpoint in such a massive database an e-mail sent by a particular person at a particular time.

Putting up the traffic figures across LINX is complete and utter nonsense, with or with his infinite iPods analogy. The figures given are for all traffic across the exchange, all the webpages, emails, Skype call, downloads, uploaded, peer-to-peer connections, everything – an apples and oranges example which massively exaggerates the actual amounts of data we’re talking about.

For example, if I were to nip over to the BBC’s iPlayer service, right now, and watch the latest episode of Doctor Who, that would result in a data transfer between my PC and the BBC’s servers of around 450-600Mb, but the actual amount of information that would need to be recorded, stored and retained to log that communication for the purposes of this bill would amount to little more than the amount of information contained in the sentence, and maybe less.

The storage requirements we’re talking about here are large and will cost a fair amount of money, but they’re not an insurmountable barrier to the creation of such a system, merely a matter of spending enough money on storage which, these days, costs a fraction of what it used to only a few years ago.

Scale is not a problem, merely and expense.

Talk of write speeds and access times is, equally, a complete nonsense.

The fantasy system that Dizzy is describing is one that would operate in real time, with live connections back to the government’s fantasy central database – nowhere in anything that’s been made public about this system, so far, is there any suggestion that that’s what the government are proposing nor would any competent techie assume that that’s what’s being suggested here. Such a system would be impossible to deliver using existing technology, but such a system would also be entirely unnecessary.

ISPs and Telephone companies already routinely store this information, for the most part in standard log files which are automatically generated by their servers and telephone exchanges. My own web hosting provider supplies my with server logs if I wish to use them, which log every connection to this blog, the IP address used, the time and date of access, what pages and files are viewed, and I’ve left this facility on its standard settings, the system automatically generates and retains a weeks worth of log information on a rolling basis, with a fresh log file generated every 24 hours.

So, right now, in a private folder on the server on which this blog, and the article you’re reading, is hosted, I have six complete text files with a record of all the traffic to this site for the last six days and a seventh live file recording today’s traffic.

And if, for any reason, I wanted to set up my own data retention system – my own central database – all I would have to do is download the latest complete daily log file at the end of the day (and I could set up an automatic job to do it) and load it into a pre-configured database using a pre-written import routine to put the right data in the right place in the database…

…all of which could also be fully automated.

There is no need whatsoever why the database that the government may be considering needs to be a real time system. If the police and/or security services need to monitor someone’s communications in real time, they’ll do that under a RIPA warrant using facilities that they already have in place for carryng out live investigations. For everything else – and the main policing/security purpose of such a database would be for collating evidence of past communications, identifying contacts and mining the data for patterns of activity  that may help them trace or identify suspects in criminal investigations are all one that can be carried out offline and would be time-sensitive only in the sense that investigators may be ‘on the clock’ in terms of how long they have to pin down usable evidence before a suspect has to be either charged or released.

So what we’re talking about here is a data warehouse and a pretty big one, but one that, depending on the retention period specified by government, could weigh at around the size of, say, Google plus batch processed updates and search requirements measured in days, although I’m sure the Police would much prefer hours, all of which comes down to the quality and efficiency of the systems search/analysis algorithms.

Technologically unfeasible? Sounds more like bread and butter stuff to me – not cheap, by any means, but beyond the bounds of possibility by any means.

It’s all too familiar in large-scale government projects that the technological expectations of civil servants gallop far ahead of reality. The Ministry of Defence’s requirements for the Nimrod radar project was a classic example of overspecification. The result was a system that was unable to process data because the technology Whitehall assumed would exist in the future, when the planes would finally take to the skies, simply never materialised. The planes, after hundreds of millions were spent, had to revert to the traditional Awacs system instead. The men who gave us the new NHS database, likewise, severely underestimated operational realities.

All of which is true, although when it comes to the NHS database the problems its has faced are more a function of the civil services inadequacies when it  comes to commissioning and project managing large scale data systems than they are of technological over-optimism.

The good news is that we will not be robbed of our privacy by this latest database because it will remain just a pipedream. We taxpayers will, however, be robbed of billions of pounds as the IT consultancies draw up their bids to design and deliver the undeliverable.

If only any of this were true.

Yes, this system will be expensive and yes, its a questionable investment although such things are difficult to assess and put a cash value on.

How do you price up the value of search which turns up a useful lead in a criminal investigation or one that pinpoints a fresh suspect in a terrorism case? I don’t know and nor, really, does anyone else. such things are difficult to quantify and open to interpretation. For some the risks of intrusion in personal privacy and civil liberties are too high a price to pay no matter how good or bad such a system might be in practice. For others, such costs are but a pittance compared to the value they place on human life and any system that might bring a criminal to justice or foiled a planned terrorist attack is worth having no matter the scale of the material or other costs attached to it.

What it most certainly isn’t, is a pipedream.

It may turn out to be an expensive white elephant in the long run – the public sector does have a long and ignoble track record of incompetence in dealing with large scale data systems – but it is a system that could be delivered using existing technology, and easily delivered at that and a cause for vigilance and careful scrutiny not a basis for complacency and fifth-rate political point scoring.

Phil Hendren is a Unix systems administrator. He blogs at dizzythinks.net

If only Phil/Dizzy had actual done a bit of thinking before he wrote this article then he may not have produced such a embarrassingly poor effort to mark the occasion of his (short lived ???) ‘breakthrough’ into the ‘mainstream’.

If there’s a lesson in this at all for Danny Finklestein and The Times, then its simply that next time you want to take a shot at a government IT project, try hiring yourself someone from the established technical press. The guys over at El Reg and The Inquirer are damn good, which is why us techies rely on them for our main fix of IT news and opinion and while it might cost you a little more cash to secure their services, at least they won’t embarrass you by trying to pass off a a bit of substandard party-hackery as technically competent commentary on a proposed government policy…

…which is what you get when you start hiring ‘writers’ on the back of a recommendation from Iain Dale.

  • Honestly, this seems about par for the course for the MSM.

  • “The guys over at El Reg and The Inquirer are damn good, which is why us techies rely on them for our main fix of IT news and opinion and while it might cost you a little more cash to secure their services,”

    As I write (occasionally to be sure)for both outlets, might I just point out that The Times pays substantially better? Like 2x?

    The rest of it, well I have to admit that when I read it I was thinking along the same basic lines: it’s the comms records, not the comms themselves.

  • You missed out data warehousing and online backup, which are the greatest consumers of bandwidth on the hosting system I operate. The LINX numbers were inappropriate for his argument.

    So yes, you’re right. Not that this makes Dizzy wrong in his broad point, though: there’s little reason to have confidence in large scale public sector IT projects. Moreover, feeling disquiet at the government’s enthusiasm for surveillance is a matter of perspective and opinion, not fact. The analogy with Orwell is frequently overused but arguable in this case.

    The ad hominem stuff is boring, but par for the course.

  • A couple of point of error in your article Unity I’m afraid.

    You’re actually misrepresenting the point _entirely_ on the the LINX figures. The LINX figures have been massively underestimated by myself. I have taken the figures I could for a year sample at LINX. I then took the average traffic bps rate, I then lopped off 50% arbitrarily to chuck out alot of noise. The purpose was too build a picture of what the whole UK Internet traffic actually looks like because it bears a massive significance to to suggestion of a centralised system.

    Next up, let us deal with the “real time”. This is neat bit of misrepresentation of my article on your part because at no time did I use such a phrase. What I was referring too however, was that if you take the figure of traffic at LINX, and, you have a centralised system, that data has to get into that system somehow, and it _will_ be on a daily basis, when you add each provider together, along the scale of petabyte raw data.

    So next you have to ask yourself the question, how does that data get into a centralised system. It is all very good, as someone mentioned on my blog, thinking about data warehousing.However, on a small scale that is not going to be a problem. On a scale being proposed it is, period. You are going to get massive I/O throughput problems trying to feed that level of data into any system, however good the software you use is to compress it on the fly.

    Now let’s us get on to rentention policy. if, as the proposals, however solid, have suggested, you have a central database with a 12 month retention policy you that would mean each day you purge off the day that was 365 days previously. At which point you feed in another day of data. The length of time it would take to feed such massive amounts of data in for a start becomes a problem (see I/O) and then you have the problem of actually getting from service provider to central database ready for feeding in.

    This is exactly what happened with NHS database incindetally, which was edited out. The NHS database ended up being so big it took longer to backup than the period that it was backing up for. Why? I/O. Now, on the point about your log files, with the greatest of respect they’re nothing compared to the log files opf an ISP which will record far more information and would also need to go through data clenasing before they could be fed into a data warehouse. More time delay. Let’s not even start to wonder upon the I/O bottleneck of searching such database for something.

    I’m not denying you’re not technical Unity, but you’re more of a developer than an operational engineer, so I’m not surprised that you would simply brush aside something as “oh it’s just a simple import routine”. It isn’t anything of the sort. The operational reality of this scheme, with a centralised database, is technological nonsense. It might look great on a whiteboard when an architect draws it up, but as an engineer, and someone who works with nothing but engineers, when they read the news on Monday there was much frivolity about how it’s crap.

    I can go on later but I am leaving the office. I will say this though, what is this rubbish about party political stuff? Where was I party political?

  • Dizzy, how up to date are you on data warehousing? I’m not being critical, but commercial petabyte databases based on sophisticated partitioning and clustered commodity servers are already here. CERN’s LHC is going to bang out somewhere in the region of 15 Petabytes of data a year. The Wayback Machine must be well over the 2 Petabytes it was approaching a couple of years back. Take a look at what Sybase are up to these days as well,

    http://www.sybase.com/detail?id=1056945

    Solutions to balancing reporting availability against data load/purge cycle times have been around since before the days when 1Gb was a big database and a VAX 11/780 was leading the herd. Remember, the database you’re reporting from doesn’t have to be the one you’re currently loading. After all, the proposed hypothetical (deliberate choice of words) centralised database wouldn’t be used by DC Plod firing up a copy of Hyperion, Business Objects or even (God forbid) Crystal Reports and hacking a quick and dirty query to fire off against yesterday’s data.

    And your quoting the LINX throughput is misleading, because it isn’t message content that’s being collected. For starters, what the bloody hell would they do with all the pr0n images and other noise that’s rushing around the net? I’d estimate (back of fag packet style) at somewhere between 15 and 20 Petabytes for a year, based on header-only information storage.

    And the objective isn’t to provide evidence sufficient to convict either, because most of the data just isn’t that detailed. For example, the guys at Global Crossing could pin the use of a mobile to a given cell. But that only gives a general indication of where a specific mobile was at a specific point in time. Depending on the cell, it could be pinned down to less than a square kilometre or more than 100 km2? And if it is an unregistered SIM then game over. However, if you already have the phone in your possession along with other evidence obtained from a suspect, then that cellphone info isn’t quite so useless. Now you can begin to use movement information to help identify new leads, new places to look for further evidence. Remember, none of the data would conclusively identify an individual to meet the burden of proof in a court of law. “My phone your Honour? Yes, but it wasn’t me using it.” No amount of text message content or call log info is going to pin the suspect.

    Now, do I think consolidation of RIPA-style data is doable from a technological standpoint alone, then the answer would be yes. It wouldn’t be cheap or easy though as an academic exercise with an unlimited budget it would be fun, fun, fun. Doable by any of this Governments favoured IT suppliers? No way. I wouldn’t trust most of them to sit the right way round on a lavatory and not nick the loo roll, let alone try to build and deploy something like this.

    It isn’t cost effective and there are plenty of other solutions kicking around now which would do the job without the glory of expensive failure. Tools like Syncsort can be thrown against flat files to great effect, narrowing data sets prior to further cleansing and reporting. It doesn’t need to be done, it’s only a proposal and in the current climate I doubt it’ll ever make it on to paper as a requirement, let alone on to the Statute. And sure it’s dumb. But it isn’t a bazillion iPods humming in the dark either (now that is a bizarre image).

  • I don’t understand half of this, ‘cos I’m just some brainless dimwit in marketing.

    But I do know nepotism when I see it.

    [Psst! Phil Hendren is a Unix systems administrator… who publishes ex-directory phone numbers of people he doesn’t like on his website when he’s having a bad day at work. Not that I’m saying he’s unprofessional or anything like that. No, the strong wind in the background made it pretty clear that ‘Dizzy’ left his desk and walked out of the building before calling my home to scream at me and tell me that I should “get a grip”.]

  • Oli

    Having seen first hand several attempts at goverment IT systems being centralised, even those with good intentions, I can pretty much gaurentee that this turns into another billions of pounds over budget fiasco.

  • The guys over at El Reg and The Inquirer are damn good, which is why us techies rely on them for our main fix of IT news and opinion

    You obviously don’t talk to the same techies as me. Most of the people I’ve dealt with have a lower opinion of El Reg than DK does of the Home Secretary.

  • Dizzy,

    “This is exactly what happened with NHS database incindetally, which was edited out. The NHS database ended up being so big it took longer to backup than the period that it was backing up for. Why? I/O.”

    You got a link to that, because that seems really strange.

  • What stops me running VOIP over a VPN? Or using Tor? Or finding an open 802.11 connection with a NATting firewall?