Purnell’s Lie Detector – How it actually ‘works’

Okay, its time for the third and final instalment of my investigation into the scientific evidence behind the DWP’s trial of Nemesysco’s ‘Layered Voice Analysis’ technology…

…and in this episode I intend to blow the lid on the system complete and explain exactly how this scientifically invalid piece of junk software has managed to gull several insurance companies and, more recently, the DWP into thinking that it might actually work as its inventor claims.

THE STORY SO FAR

Based on the analysis of the original patent for this system conducted by Anders Eriksson and Francisco Lacerda, what we’ve already established is:

1. There is no known scientific theory that explains exactly how humans detect emotional content in speech and, certainly, no evidence to support the idea that a piece of computer software can perform this highly complex task with any kind of accuracy or reliability.

2. The claim, made in the patent, that this software analyses ‘intonation’ in speech is, at best, founded on a wholly superficial understanding of acoustic phonics and, at worst, is simply inaccurate.

3. The signal processing carried out by the software on the low quality samples it uses effectively strips the speech signals captured by the system for analysis of almost all their useful information, and

4. The analysis, itself, is both mathematically unsophisticated and based on nothing more than performs a few very basic statistical calculations on two arbitrarily defined types of digitisation artefactes present in the sppech signal after its has been sampled and filtered to the point that much of the signal, were it to be played back to the operator, would be barely recognisable as human speech.

In short, the entire method on which the software is based is entirely meaningless, as is the system’s output which, to all intents and purposes, can be taken to be a near random as makes no difference within certain limits.

So, that being the case, just how have several insurance companies and the DWP been sold on the idea that the system might actually work?

SURFING THE BOGUS PIPELINE

A major part of the answer lies in an independent field trial of the system which was conducted by researchers from the University of Oklahoma, with funding from the US Department of Justice, and published in June 2007.

The study, by Damphouse et al. tested the system on a random sample of recently arrested prisoners at an Oklahoma county jail. These prisoners – 319 in total – were questioned about the recent drug use (i.e. in the days immediately before the were arrested) using either the LVA system or a rival ‘voice stress analyser’, after which they were subjected to standard toxicology tests, the results of which were used to identify exactly which of the prisoners had used which drugs before being arrested and, of course, which of them had lied to the researchers about their drug use during the interviews.

If you’ve read the previous article which covered the laboratory trial of this system by Hollien and Harnsberger you’ll have already realised that this is precisely the kind of trial that meets and overrides the objections raised by Nemesysco’s then-US distributors to the test protocols used in Hollien and Haarnsberger’s study.

And, if you’ve read and understood that article then you won’t be the slightest bit surprised to find that, yet again, ‘layered voice analysis’ proved to be a dismal failure when it came to identifying deception; of the 15% of prisoners who we found to have lied to the researchers about their recent drug use, Nemesysco’s LVA system successfully identified only 20% of them as having tried to deceive the researchers and produced the now familiar pattern of high error rates seen in Hollien and Harnsberger.

However, Damphousse et al did manage to identify and, somewhat unusually, quantify a very interesting ‘side effect’ of using this technology when they compared the data from this study with data from a near identical study, conducted three years earlier, a study in which ‘lie detectors’ were not used during the interview stage of the study.

In this earlier study, the research for which was conducted in 2003, it was found that around 40% of the prisoners whose toxicology report indicated that they had recently used one or more narcotic drugs, had lied to the researchers when questioned about their recent drug use.

In the study conducted by Damphousse et al, only 15% of the prisoners who had recent used drugs lied to the researchers about their recent drug use. Prisoners at this particular county jail are, it seems, around 60% less likely to lie about their recent drug use if they believe that such a deception may be detected by the use of a ‘lie detector’ during the interview.

This is not new information, in fact that this effect, which was first identified as long ago as 1971, is a well known and understood feature of ‘lie detection’ technologies, so much so that it has a name all of its own; the bogus pipeline effect.

Put simply, people do not like being ‘second guessed’ by a machine and are much less likely to lie or give false information under questioning if they believe that there is a highly likelihood of those lies being accurately detected.

Although researchers have known about the bogus pipeline effect for coming up to forty years, rarely have any studies been published which have attempted to quantify the effect and ascertain just how much more likely people are to give honesty answers if told that they are being assess using a ‘lie detector’, and the answer that Damphousse found is that they’re a hell of lot more likely to answer questions honestly under such conditions, about 60% more likely.

At this point, it is vital that you understand one very important fact – the bogus pipeline effect has absolutely nothing whatsoever to do with whether or not a ‘lie detector’ actually works. This effect is all about belief – if the subject believes that any lie they tell stands a very high chance of being detected then they’re much more likely to give honest answers to questions, but this only works for as long as they believe that system is capable of detecting lies. If that belief is shown to be wrong, then the ‘benefits’ of the bogus pipeline effect are lost, entirely.

It’s very much like stage magic, where the trick is only entertaining if you don’t know how its done.

Whoops-a-daisy – I’ve just realised that I’ve blown the trick. What a shame.

The important point to note about the bogus pipeline effect is that the higher the stakes riding on a lie, the less likely it is that the effect will induce an individual to provide honest answers. In comparing the two studies, what Damphousse et al. found is that while introducing what the subject believed to be a ‘lie detector’ into interview had a significant impact on prisoners who had recently used only cannabis (and who were much more likely to tell the truth when they believe a lie detector was being used), it had little or no impact on prisoners who were found to have used ‘hard’ drugs like heroin, methamphetamine and crack cocaine.

What this means for the insurance companies using this system in the UK (and the DWP, of course), is that while the system does scare off a sizable amount of the kinds of small scale and opportunistic frauds committed by the kind of people who fail report a notifable change in circumstances or add a few quid on to the value of item that they’re claim for on their insurance, it has little or no impact on high value, deliberate frauds. In fact, due to the overall unreliability of the system, if an insurance company or local authority comes to rely heavily on this technology to triage claimants, fast-tracking though the claims system any claimants who ‘pass’ the ‘lie detector’ test, then not only is likely that the system will passport through, without investigation, as many fraudulent claims as it actually appears to detect, significantly reducing the likelihhod of those fraudulent claims being detected when compared to the checks they would have carried out were they not using the system.

Based on the published data relating to the Harrow Council pilot, a claimant contacting the council in an effort to commit a deliberate housing benefit fraud has a less than 1 in 20 chance of being detected by this system, which is pretty good odds if there are significant sums of money at stake.

To ‘counter’ this issue, the companies and councils using this system will undoubtedly point to the fact that the system is not used on its own to assess claimants, rather it is used by a trained operator who make use of the both this technology and a collection of interview practices and techniques devised by DigiLog UK, to try and catch out fraudulent claimants by looking for inconsistencies in their response to questions. Putting in the human operator to second guess the technology is likely to ‘improve’ its apparent performance to a small extent at the outset but even here the technology has a well known psychological ‘trick’ to play on its operator.

As with the introduction any new and unfamiliar technology, the individuals operating the technology will initially be somewhat sceptical of its capabilities and are, therefore, highly likely to actively second guess the system and override its assessment if it fails to match their own subject impressions of a particular claim/claimant. However, over time, and as long as the system appears to confirm the operator’s judgement more often than it disagrees with, the operator will come to rely more and more heavily on the system and become less likely to second guess its assessment until, eventually, the situation that existed out the outset is turned on its head and the operator begins to the use the system to second guess their own judgments and deferring to the system’s output even if they have their own, subjective, misgivings about a particular claimant.

The bogus pipeline effect accounts for a sizable proportion of Harrow Council’s claimed cost savings arising from this trial, although the exact amount seems to be something of a ‘moveable’ feast.

In May 2008, The Inquirer reported that Harrow were claiming an estimated £420,000 in savings, although thi figure appearsto have been reached in a rather dubious fashion:

The figure was calculated on the assumption that 132 people who refused to complete the voice-risk analysis assessment would otherwise have tried to cheat the system; and that 500 people who, though they had been flagged as low risk, had declared their personal circumstances had changed and no longer needed benefits would also have otherwise attempted to cheat the system.

Such assumptions are, of course, invalid without additional supporting evidence.

By February of this year, The Times were giving Harrow’s cost savings as ‘over £336,000’ but no detail as to how this money was allegedly saved and, as I revealed only yesterday, the pilot actually cost the DWP £125,000 not the £63,000 cired by the Times in this same article.

By last week, when the Guardian got on the case, Harrow were claiming that the system saved them £110,000 in benefits payments, £15,000 less than it cost to run the first year of the trial, according to the DWP.

Oops, it looks the numbers just don’t add up, but that’s besides the point because in addition to scaring off a bit of small scale fraud the other route to saving money is that of making ‘efficiency savings’ on the actually processing of claims, saving that can only be made in any significant quantity be relying heavily on the technology to triage claimants and by, then, fast-tracking those claimants flagged by the system as ‘low risk’ through the rest of claims process with little or no further investigation – and have we’ve already seen, the high rates of false positives and false negatives then technology generates makes it highly likely that the system passes as many fraudulent claims through to payment without further checks as it does (accidentally) identify.

The bogus pipeline effect goes a long way towards explaining how and why the DWP and several insurance companies have been taken in by this system.

The belief, however mistaken, that ‘lie detection’ technology may actually work is enough to prompt some would be claimants to drop their claims entirely when asked to undergo ‘the test’ and will persuade other not to try it on and inflate the value of their claim in the hope of screwing a bit of extra cash out their insurer – and from the evidence in Damphousse et al, the effect can be pretty dramatic. At current estimates, fraudulent claims amount to about 5-5.5% of both the total number of housing benefit claims submitted annual and the total amount of benefits paid to claimants. In real money that adds up to an estimated £480 million a year in overpayments, slightly less than the amount the DWP ends up incorrectly spending due to its, and local authorities’, own screw-ups (£525 million) but that still a lot of cash and a reduction in that sum at the level found by Damphousse (60%) weighs in at a whopping £288 million a year in potential savings…

…but only as long as the public remain blissfully unaware of the fact that the system doesn’t actually work as its inventor claims and that ‘voice risk analysis’ is nothing more than fancy (and at £2.4 million for two years of the DWP trial) and expensive pretext for conducting random ‘fishing expeditions’.

So, if all you’re concerned about is the ‘bottom line’, which is precisely what the insurance companies using this system will almost exclusively be concerned with, it becomes fairly easily to overlook the system’s little ‘foibles’ – like the 60%+ error rates it routinely generates – and I think that there’s little doubt that this has played a big part in the DWP’s thinking as well, especially as its under heavy pressure from the Treasury both to cut fraud rates and administration costs.

It also this last point that makes local authorities an easy target for these trials as the DWP is currently in the process of scaling back the subsidies it provides to local authorities to cover the costs of administering housing benefit,  just at the time that we’ve run into a recession, which will necessarily lead to more housing benefit claims and increased administrative costs. According to its 2008 annual report, the DWP were looking to cut the annual cost of administrative subsidies to the local authorities from £680 million (for 2007/8) to £565 million a year by 2010/11. Given the choice that creates – cut costs or raise council tax – its hardly a surprise to see council jumping on a system which, its suggested, will generate significant reductions in fraud and efficiency savings.

CALIBRATION, CALIBRATION, CALIBRATION

The Bogus Pipeline Effect is only one part of the story – a major part, admittedly, but not enough on its own to cover the obvious failing of this technology when its put to the test under conditions in which Nemesysco cannot exert any kind of influence over the results.

Selling this technology to insurance companies (and the DWP and local councils) takes rather more than simply pointing them towards the bottom line and hoping their don’t notice the error rates. In particular, both the insurance sector and the DWP have their own estimates for the levels of fraud within their respective systems and for any technology which claims to identify ‘high risk’ claims/claimants this presents an important challenge. The technology, when put to use, must generate results – in terms of the number of claimaints flagged up as ‘high risk’ which broadly meet the client’s own estimates and expectations for the level of fraud and the number of dishonest and fraudulent claims they’ll typically receive in any given period of time.

If, when its tested, the technology flags up either too few or too many ‘high risk’ claimants then the illusion that the technology might actually work will fall apart very, very quickly.

One way to do this, of course, is by producing a system that actually works – but we already know that that’s not the case here, but all is not lost because the basic characteristics of this particular technology actually makes it possible to ‘calibrate’ the system for use in such a way that it appears to generate results in the same ballpark as the client’s expectatations, even though it doesn’t actually detect stress, deception or, indeed, anything else…

…and here’s how it can be done.

To begin with, you need to understand a couple of important things about exactly how this technology produces any kind of results at all.

We’ve established, with the help of Eriksson and Lacerda’s work, that method of analysis used by the software is arbitrary and entirely meaningless and, as a result, the scores the system generates for each of its four primary and anything up to 18-20 secondary parameters when carrying out its ‘analysis’ are equally meaningless.

However, meaninglessness, in this case, has a very useful property because what this means is that, for any of the four primary parameter and, particularly, the parameter which allegedly measures stress (and therefore, deception) the meaningless of the output score generated from analyses a speech signal, when broken up in to segments of about half a second per segment, is, with certain idenitifiable limits, as near random as makes no difference.

So what are these limits?

Actually they’re nothing more than maxima and minimi which define a range of scores into which a majority of the scores output  by the system will fall – ideally what you’re aiming for is a 95% plus range. To find this range you simply feed the system a number of test samples of different people speaking and it doesn’t matter in the slightest what the speech is or whether what the individual in each sample is saying is the truth or a lie. Run enough test samples through the software (and 100-200 will probably do nicely) while collecting the output scores for analysis and you everything you need to identify both the range of possible score to 95% reliability and the score and when you have that information you have everything you need to ‘calibrate’ the system.

So what does ‘calibrating the system’ actually entail?

Well for each of the output parameters the system has an ‘output threshold’ the exact level of which can be set by the operator. This too is utterly meaningless, in scientific terms, but of critical importance when trying to generate a convincing looking set of results because this threshold level determines which score the system flags up as indicating that an individual is stressed, and which scores it ignores. Any half second sample which generates a score for the stress parameter higher than the operator determined threshold will be flagged up as indicating stress, generating an on-screen message – and that, together with a polygraph style graphical ‘trace’ is all the information that the operator has to go on when deciding whether a particular claimant is lying in response to a specific questions and, therefore, either ‘high risk’ or ‘low risk’.

Setting the output threshold for the system’s ‘stress parameter’ doesn’t tell you anything whatsoever about whether the claimant is either stresses orr lying at any given point in the ‘analysis’ but, because the output scores are, in practice, as near random as makes no real difference, setting the output threshold level to a particular score does have one very useful effect – it determines, over a number of tests, roughly what percentage of claimants will be flagged up by the system as being either high risk or low risk.

Let’s explain this with a hypothetical example. Imagine that the calibration tests indicate that 95% of all scores for the stress parameter will fall somewhere between a minimum score of 1o and a maximum of 30, with an average score for all tests of 20.

From that information we can create a simply calibration scale in which setting the output threshold to the minimum score (10) will result in the system flagging up all claimants as being stressed and/or deception and setting to the maximum (30) will result in no one being flagged by the system (execpt for the fact that our range is only 95% reliable, so a few outlier score are likely to crop up which still generate a positive or negative result when the system is set to the minium or maximum point on the scale. Because the average score is 20, we also know that setting the output threshold to 20 will flag up half out claimaints as high risk and the other half as low risk, and that if we move the threshold level by a value of 2 in any direction within the range we will be increasing (or decreasing) the probability of an individual claimant being flagged as either high or low risk by 10%.

The upshot of all this is that, if we know that the client expects that 20% of all claims submitted to them will ber fraudulent, then (using the hypothetical figures above) we need to set the output threshold level to a score of  26, as this is the level that simple probability predicts will give us around 20% level of ‘high risk’ clients.

Flagging the requisite number of claimants up as high risk is not enough, on its own, to create a plausible looking set of results. The system gets its wrong, at least as often as it hits on a genuine fraud by chance and when used in a live setting, high risk claimants will be investigated, revealing which (and how many) of the calls made by the system were correct and incorrect.

To illustrate this point, imagine that we have a group of claimants in which it estimated that half are attempting to defraud the system and that the system is calibrated to reflect this information, i.e it will identify 50% of the claimants as being high risk. With two options to choose from, high risk or low risk, the system will make the wrong call as often as its make a right call and score a hit on only half the actual fraudulent claimants. If, on the other hand, our estimate for the level of fraudulent claims is only 10% and we ‘calibrate’ the system according, our chance of scoring a successful hit falls dramatic in line with the lower number of actual fraudulent in the overall pool and the reduced calibration setting – at 10%, only one in ten of the high risk claimants will be found to be a genuinely fraudulent claim, giving the system a 90% error rate, a rate which will inevitably raise suspicions in the client that the system isn’t all its being cracked up to be.

To produce plausible looking results, it necessary to find a plausible method of increasing the apparent number of direct hits to a level that makes it appear, to the client, that the system is actually working wth a sufficient degree of accuracy no to cause the alarm bells to start ringing.

Fortunately, we have a couple of things going in our favour here.

One is the bogus pipeline effect, which can be ‘sold’ to the client as a deterrent that will automatically result in the system identifying fewer actual fraud than might otherwise be expected, due to its ‘scaring off’ a proportion of those claimants who might otherwise have tried to submit a fraudulent claim were the system not being used. that, alone, gives you some ‘play’ with the client when, after investigations have been carried out, its found that the system is scoring fewer direct hits than they might have expected from their own estimates of the level of fraud.

Lowering the client’s expectations will help things along, but not by that much, but what you also have in your favour is the fact that system doesn’t operate in isolation, it works with a human operator who can listen in on the calls and fine tune the results using their own subjective impressions of the claimant.

And that’s exactly what happens in practice, in fact the full voice risk analysis package marketed to the DWP and to insurance companies includes training the operators in what DigiLog Uk call ‘Narrative Integrity Analysis Techniques’, which the DWP also refer to on their website as ‘behavioural analysis and conversation management’.

If we ignore the buzzwords and hit the research literature it does take long to figure out that DigiLog’s Narrative Integrity Analysis Techniques are not much more than a collection of very well understood bits of applied cognitive psychology, the core research for which is between 15 and 25 years old. This offers nothing particularly new or innovative. Its really  just a bunch of interview practices that were developed by psychologists back in the late 1980’s and early 90’s to help the law enforcement community improve their general approach to conducting witness interview, but it does have the virtue of having some soild scientific foundations behind it,for all that it has relatively limited commercial value due to the fact that all the core research information was published in scientific journals and is openly available to anyone who might like to put together their own competing training package.

Vocal lie detection by humans, rather than using a machine or computer software, has been fairly well researched since interest in the subject first picked up during the 1940’s, and while the research evidence does suggest that putting a well-trained human operator in the loop will help somewhat, the actual gains in accuracy are not particular spectacular. What the available evidence shows, overall, is that human’s can successfully manage to navigate the rather haphazard business of indentifying if someone is telling the truth will a moderate degree of success – 10-20% above chance in some of the most successful research studies, but when it comes to sniffing lies, human performance is no better than 50:50 either.

Nevertheless, using a human operator to second guess the technology is going to help improve the systems apparent success rate because, even if operating at no better than chance, the human operator will inevitably pick out some of the likely fraudulent claimants that the technology misses, increasing the number of true positives in the results and overrule the system on a proportion of the false positives it generates. Together this should bring the number of direct hits up rather more towards what the client is expecting and improve the system’s success/failure ratio by stripping out some of the more obvious examples of the system fingering an entirely legitimate claimant as a potential fraudster.

With a human operator in the loop is possible to get a bit close to the target level we need to hit to give the clients a plausible looking spread of results, but even so it unlikely that we’ll be able to improve the apparent success rate up above 50% accuracy and we’re much more likely to come in a fair bit under that level, maybe 10-20% under on average.

So, while we can lower expectations a bit and improve the technology’s apparent accuracy by relying on a human operator to second guess its output, we’re still looking at a best case scenario of getting the call on an individual claimant wrong at least as often as getting right – the ‘coin flip’ scenario that Hollien and Harnsberger identified…

…and that will leave us a considerable way short of generating a plausible set of results, in fact so far short that its likely to raise significant questions about the validity of the system and when it actually works in the manner that its developer and distributors suggests.

We still need to find a way to increase the number of fraudulent claimants that the system appears to correctly identify and the only option we now have left is that of increasing the size of the pool of claimants that the system flags as being ‘high risk’ to a least double the estimated/expected level of fraudulent claimants in the claims system we’re dealing with (i.e. Insurance, Housing Benefit, etc.)

What we’ve arrived at is a testable hypothesis, one that make a quite specific prediction about the numbers of claimants that the system will flag up as higk risk, i.e. that it will be at least double the estimated rate of fraud for the type of claims being assessed.

(In a real world situation we should, in fact, expect the rate at which high risk claimants are identified to be slightly more than double as the individual who calibrates the system is very likely to tack an extra couple of percent on top, if they can get away with it, in order to maximise the number of actual fraudulent claims that the system hits on by chance)

So, for Housing Benefit, the government’s estimated fraud rate is around 5-5.5%, giving a baseline prediction of 10-11% and likely actual setting of 12-13% when we allow for the system being calibrated with a little extra wiggle room.

To date, we only have one instance where sufficient data has been released into the public domain to allow us to identify the actual rate at which claimants were being flagged up as high risk, the Harrow Council pilot about which The Times reported:

Early results from Harrow Council in northwest London show that, of 998 people assessed using the technology, 119 – 12 per cent – were identified as “high risk”.

As for the other pilot sites, we’ll have to wait for the FOIA request I put in earlier this week to test whether this prediction holds across the other sites that took part in the first year of the trial.

However, there is one other piece of information in the public domain which does allow us to put this prediction to the test, a press release issued by Highway Insurance in 2005 in which the company disclosed the rate for ‘low risk’ claimants while talking up the cost savings they’d made, from which we can easily calculate the high risk rate.

Before revealing Highway’s figure, lets run the prediction.

Overall, the estimated level of fraud in insurance claims is around 10-12% – the numbers here can be a bit variable depending on the type of fraud you’re dealing with (some type of insurance lends themselves more readily to fraud than others) but having looked a fair range of sources, 10-12% is the figure that’s most commonly quoted for the type of claim we’re dealing with here, which is motor theft.

So, using our very simple formula, we get a baseline prediction for high risk claimants of 20-24% and, because were starting from a higher base estimate than when dealing with Housing Benefit, we also have a bit more scope for getting away with tacking a bit of wiggle room on top – 3-4% will not look unreasonable and we may even be able to stretch to a full 5-6% and go for a nice round 30% as a top of range prediction.

CONCLUSIONS

The validiity of the analysis in this article hinges on one thing; the validity of Eriksson and Lacerda’s analysis of the methods disclosed in Liberman’s 1997 patent, on which Nemesysco’s LVA technology is based.

As such, if Nemesysco wish to dispute the validity of this article, and particularly the section on the ‘calibration’ of the system to generate plausible looking results, they need only produce the scientific evidence which demonstrates that Eriksson and Lacerda have got it wrong and that the current LVA system that the DWP have under trial does not use the same methods set out in the patent.

That, alone, creates an interesting situation because unless Nemesysco can show that there is a serious, and even fatal, flaw in Eriksson and Lacerda’s work then the evidence that the company would have to put up to invalidate their claim that the method used by this system is both arbitrary and scientifically meaningless would also invalidate the patent, itself, by demonstrating that the current system is not actually based on the methods described in the patent.

In the field of automated vocal ‘lie detection’ this would not be without precedent. Haddad et al (2002) report that, late on in their trial,  one of the Voice Stress Analysers they were testing, the Diogenes Lantern, was found to be measuring energy changes in vocal spectrum envelope in 20-40Hz frequency rate rather than the Lippold Tremor, which is found at 8-12Hz in the large muscles of the human body (and which has never been in the muscles of the larynx as per the theory put forward to explain how voice stress analysers ‘work’).

That said, Nemesysco have yet to provide any kind of scientific evidence of the kind that would invalidate the scientific content of Eriksson and Lacerda’s paper and have relied, instead, on threats of litigation and allegations of libel to try to suppress its contents and keep their analysis out of the public domain.

To date, Eriksson and Lacerda have focussed primarily on the singal processing carried out by this technology and its internal ‘analysis’ methods, which they conclude are scientifically meaningless. What I’ve attempted to do here is supplement their work by considering the question of whether it is possible for a piece of software which generates meaningless information to generate results which would look plausible enough to convince an insurance company, a local authority or, indeed, the DWP, that it actually does detect fraudulent claimants – yes, in my view, it is possible and the analysis is this article explain both how and why and outlines the method by which its possible to pull of that particular feat.

Crucially, the method I’ve outlined in the section on calibration relies entirely on one thing; that the individual calibrating the system for use and setting up the output threshold levels used when the system is deployed, must possess a clear understanding of the system’s actual output characteristics in order to get the settings right and produce results which appear to match the client’s expectations and estimates for the level of fraud within their claims system.  They must know not onlyhow the system actually ‘works’ and produces its results – and therefore that it doesn’t operate in the manner claimed by Nemesysco – but also, in general terms, why it generates the outputs in the manner that it does.

If this technology is the ‘hoax’ that Eriksson and Lacerda claim that it is, then the individual(s) who calibrate the system for use, or who, at least, supply the calibration setting used to configure the system must also be fully aware that the system is a ‘hoax’.

Of course, there is the possibility that I may be wrong here, and that there is some other – as yet unknown – method of calibrating the system which would account for the observable characteristics of results that in generates. With a system of this kind one can never entirely rule out the possibility that its calibration settings are actually generated using simulations and a process of trial and error, however this too would indicate that the individual(s) generating the calibration settings have some awareness that the system does not perform exactly as Nemesysco claims albeit that it doesn’t entirely rule the possibility of them operating under the genuine, but mistaken, belief that it can actually detect stress and/or deception, merely that they don’t really understand how the system does what it supposedly does. As regards this last possibility, there is, in the exchange of emails regarding the test protocols in Hollien and Harnsberger (2006) fairly clear indications that the running of simulations is a part of the actual calibration process:

As we move forward on Phase 1 of the study, we will attempt to identify what LVA components can be used to best measure the subjects’ states under the conditions you created in your laboratory. Based upon the little information we now have, given our inability to preview any data segments, the following is what we can tell you at this time.

What Hollien and Harnsberger were interested in was the output threshold level for only one of the systems’ parameters, the ‘JQ’ parameter which they identified from the software’s manual as the output which, ostensibily, indicated the level of stress/deception. What they got in response to their inquiry was an extensive list of ‘suggested’ output characteristics that the company they were dealing with, Nemesysco’s former US distributor V LLC, thought they might observe when conducting the test phase of their study, the majority of which offer the researcher’s no useful information whatsoever, given the very specific nature of their study, e.g.

7. We suspect that some persons (about 40%) will show abnormal scores in the imagination readings.

Why? Particularly when a table of parameters, and their meanings, taken from one of the studies promoted by Nemesysco on its website provides this description of the ‘imagination’ parameter.

Represents levels of memory and imagination. This increases in dementia or schizophrenia.

So what is being meant by ‘abnormal scores’ – is this suggesting that 40% of the subjects in the study are likely to be in the early stages of Alzheimers or undiagnosed schizophrenics?

In response to a single enquiry about just one parameter, the company that Hollien and Harnsberger offered a list of 13 ‘suggestions’ for the possible behaviour of different parameters, or for other aspects of the study, the vast majority of which had nothing whatsoever to do this the output that the researchers were looking at, all prefaced by this statement:

Based upon the limited information we have, our training faculty has identified the following possible outcomes for the Phase 1 protocol:

And how did they come up with these possible outcomes?

Did they derive them from their detailed theoretical and technical knowledge of the system as bona fide predictions?

Or is it much more plausible to suggest that they tried, as best they could, to run a set of simulations of Hollien and Harnsbergers then-proposed study in order to generate data which they could then put forward to the researchers with their own detailed ‘suggestions’ as to how, exactly, the researchers should interpret their results.

I cannot, of course, say with 100% certainty that the calibration method set out above is the actual method used to set up this system for use in the DWP’s trials, but what I can say is that:

It fits in with and is, indeed, derived from the analysis given in Eriksson and Lacerda’s paper.

It does account for the high rates of false positives evident in the independent studies by Hollien and Harnsberger and Damphousse et al. AND for the data we have, so far, from the DWP trials.

It make a clear prediction about the numbers of claimants that are likely to have been flagged as high risk at sites other than Harrow Council, one that can (and will) be tested whenthe detailed information I’ve requested under FOIA is delivered.

AND

Its validity can be tested quite easily, as can Eriksson and Lacerda’s claims about the system.

TESTING THE CALIBRATION HYPOTHESIS

How can we put this to the test? Quite easily, as it happens.

Well. obviously, the first prerequisite is an actual copy of Nemesysco’s LVA software – one of their main complaints about Eriksson and Lacerda’s paper was that they used a simulation of the system derived from the information in the patent and not the actual software itself, so we need to overcome that objection by testing the real thing.

Next, we need a collection of speech samples, anything upwards of 500 will do but the more we have to work with the more the better, so lets say we take a 1,000 samples to give us a good range of different voices, i.e. male, female, different age groups, different natural pitches, etc. As for the content of these samples, that’s completely irrelevant, in fact to standardise the test, what we can use is samples of people reading randomly selected passages from a well known novel, which could be anything you like – Harry Potter, Pride and Prejudice or the latest Jeffrey Archer, it doesn’t actually matter.

After that, the test is easy – what we do is randomly select 200 of the speech samples to serve as our calibration data, run them through the software and analyse the result to find effective range of scores and average score for the JQ ‘stress’ parameter. That, as I explained earlier, gives us the information  we need to calibrate the system to give – if I’m correct – a set of samples that the system will identify as ‘high risk’ the proportion of which relative to the total number of test samples will be determined by the calibration setting we assign to the output threshold level for the JQ parameter.

So, we choose a target percentage, calculate the threshold setting needed to deliver that figure, and run the data through the system, recording the results.

After that, we change the threshold level to target a different percentage of high risk results and run the test again using the same set of test data – and If I’m correct, the number of ‘high risk’ results will pretty closely match the target percentages give or take a few random fluctuations arising from the fact that we’re still using a relatively small set of samples.

That is how you actually test the validity of this technology and Eriksson and Lacerda’s claim that it produces essential meaningless information – you feed the system meaningless information and and attempt to replicate the ‘results’ generated in the DWP trial, or by Hollien and Harnsberger, simply by manipulating the system’s calibration settings in line with the method I set out in this article.

  • Carl Eve
  • john gibson

    top notch stuff.

  • Pingback: Britblog roundup 214 - Mothers Day edition - Blogs - NewsSpotz()

  • First off: I found you through your articles on LibCon and wanted to say thank you for both the work and the time you spend explaining it to people in comment debate.

    Secondly; I sincerely hope you can lay hands on your test materials; I’d love to see this demonstrated empirically as well as hypothetically.

    And thirdly, to Carl Eve: they only got seven responses to their consultation. Doesn’t that tell one a lot. On the other hand, the reference to polygraph testing presumably means they’re using the system which does have some evidence of being based on working science?

  • This is excellent stuff. However do you find the time ?