One thing that those of us who’ve been working on the DWP Trial investigation discovered, only this week, is the we’re not alone in asking hard questions about the DWP’s trial. So, if you’ve read this week’s technology supplement then you’ll have already noticed that the Guardian has run an article on the technology behind the DWPs ‘lie detector’ trial, the same technology that I’ve been investigating for the last few weeks.
You may also have picked up, from Sunny’s note at Lib Con, that we’ve had to jump into telling this story a little sooner than we’d hoped to, and the Guardian’s article is the reason why. That’s said, I’d recommend you read it as although it doesn’t go into the kind of detail we’re running with, it does provide a pretty good overview of some of the main issues given the space that its author, Charles Alleneven though it really doesn’t get into the nub of the story or cover anything like the amount of ground we’re covering here.
That said, it does provide some useful information and fair overview of some the main issues, some of which helps explain some of the issues we’ve faced in putting this story together and getting it ready for publication.
For example, the Guardian references a paper by two Swedish linguistics professors, Anders Eriksson and Francisco Lacerda entitled “Charlatanry in forensic speech science: a problem to be taken seriously”, which was published, in 2007, in the International Journal of Speech, Language and the Law. This is the same paper that triggered our own investigation. Eriksson and Lacerda’s analysis of the patent on which this technology is based, and their criticisms of its claims and methods, is of critical importance in understanding why the system is, basically, a hoax or, as they put it to the Guardian, why ‘the scientific provability of the Nemesysco code is akin to astrology’.
Unfortunately, the paper was withdrawn from on-line publication by the journal’s publisher, Equinox, after Nemesysco threatened to sue them for libel over the content of the paper – which is one of the complications we’ve faced in getting thing ready for publication, as the arguments in this paper are particular important to understand how and why the system doesn’t do what its developer’s claim.
According to the Guardian, the reason that paper was ‘pulled’ was because it contains contains ‘personal attacks’ on Nemesysco’s founder, Amir Liberman, and that’s the story that Liberman has telling anyone who’ll listen ever since his threat to sue the journal publishers became public knowledge.
That’s not, however, what the first letter that Liberman’s lawyers sent to Equinox actually says, in fact the first thing that this letter claims is defamatory is actually the claim that…
Our Clients’ [Nemesysco/Liberman] technology does not work and cannot work and is therefore arbitrary and consequently worthless, contrary to our Clients’ claims with regard to it. This allegation is presented in various ways and pervades the Article.
This claim has nothing to do with the personal character of Amir Liberman or the business practices of his company, Nemesysco. It is based solely on Eriksson and Lacerda’s scientific evaluation of the patent on which the system is based and on how that patent claims the system processes speech signals and generates its results. Its the method by which the patent indicates that the technology does this that Eriksson and Lacerda claim has no scientific validity and, as a result, makes the entire system arbitrary and worthless, and because the method is scientifically invalid, according to the two professors, not does this technology work in the manner claimed by Nemesysco but nothing they could do to it by way of updating the software will make the slightest bit of difference.
It doesn’t work, according to Eriksson and Lacerda, because everything it does in processing and analysing speech signals is completely and utterly meaningless.
If Nemesysco are unhappy with the scientific content of Eriksson and Lacerda’s paper then the only legitmate response is for the company to produce the scientific evidence which refutes their criticisms, which is precisely what the company hasn’t done. In fact, in sicking their lawyers on Equinox, the company failed to provide a single specific example of any content in Eriksson and Lacerda’s paper which they considered to be defamatory – what the lawyer’s actually wrote was:
While it might be usual to provide examples from the Article demonstrating the said defamations, we refrain from doing so on the basis that no person acting in good faith could seriously argue that the article is not openly defamatory in the above ways – starting from its title.
To which one can only could only ever adequately respond by referring the company to the response given to the plaintiff in Arkell vs Pressdram.
Looking for Mr Koskas
The Guardian’s coverage, so far, is a long way from covering even a tenth of the full story but it does provide us with some fresh ‘meat’ to work with and an opportunity to tell you a few things that the the Guardian hasn’t mentioned and introduce a fairly important character in the overall story, DigiLog UK’s Business Development Manager, Lior Koskas.
But Lior Koskas, the business development manager of DigiLog, says the VRA system cannot be separated from its user, because the system only picks up stress. He does not claim it spots “lies” on its own. “Only when the technology and an operator trained by us spots it, then can we say there’s a risk someone is lying.” Has there been a scientific “blind test” of the system? “No,” Koskas says, “you can’t say you’re using something if you aren’t.”
He adds that the technology “hasn’t been scientifically validated”, but he rejects Lacerda and Eriksson’s criticisms. “With any technology you will have opinions,” he says. “But how many of these scientists have tested it properly? They talk about the technology in isolation, as though you don’t need anything from the operator except turning it on or off. But the majority of the training course is about linguistic training analysis, learning to listen. Anybody using this [technology] in the UK doesn’t use it in isolation.”
Koskas and Liberman have an interesting common history which the Guardian article doesn’t disclose.
Before going out on his own and founding Nemesysco, Liberman work for/with a small Israeli software company called Makh Shevet Ltd, which had been founded in 1991 on the Glil Yam kibbutz. From 1991 to 1997, Makh Shevet’s main (and seemingly only) line of business was that of producing educational software and computer games – the company actually signed a distribution deal with Sony in 1994.
Liberman appeared on the scene in 1997 and in November of that year, about a month before Liberman filed the main patent for his system, Makh Shevet announced that it was changing direction and moving into the lie detection software business with upcoming release of a program called ‘Truster’. This was the first piece of software to be released which was based on Liberman’s patent.
About a year earlier (1996), Koskas went to work for Mach Shevet as a project manager and then, in 1997, he moved (on paper) from Mach Shevet to a sister company, Trustech Ltd, which was basically set up by the people who Mach Shevet in order to market a ‘professional’ version of the Truster system to commercial businesses. As for why this new company was set up, we can only speculate, but it seems like that the people behind both companies would have felt that Mach Shevet’s past history in publishing educational software and computer games was hardly the kind of image that would prove helpful when trying to sell Liberman’s system to the commericial/business sector.
So, in 1997, Koskas joined Trustech and became its training manager, a role which included (according to his personal CV) working closely with the software developers and I(allegedly) a psychologist who were engaged in developing the software and going out to business clients and training them to use the system – and so closely were these two companies related that Trustech was actually handed the trademark rights to the Truster name, a moved which caused Liberman some minor difficulties when went on to found Nemesysco in 2000 as, although he owned the patent on the system, he didn’t own the name under which it had been marketed for the previous three years and had to, therefore, rename his own version of the software, TIPI.
Liberman split with Mach Shevet/Trustech in 2000 and founded his own company, Nemesysco, and in the same year (July/August to precise) Koskas also left Trustech, and Israel, and move to the UK to join DigiLog UK, which was then a new and largely unknown entrant into the fraud risk management having only been incorporated in 1999. Koskas’s first major piece of work for DigiLog appears to have been that of managing the installation of Highway Insurance’s ‘Voice Risk Analysis’ system, the first to be piloted (and then adopted) by a company in the UK’s insurance sector, and this introduces another somewhat interesting connection in the story because, at the time, Highway’s in-house fraud risk management team was headed up by Kerry Furber, a former fraud squad officer who had moved from the police to the insurance sector some years previously. Highway’s system was launched in July 2002 and, within nine month, Furber jumped ship and moved to DigiLog to become its Chief Executive.
The ‘Impossible’ Scientific Test
As for Koskas’s comments to the Guardian, which amount to a claim that the system cannot be tested under rigorous scientific conditions…
…well let’s just say that I’ve heard it all before – in fact I can actually show very similar arguments being made to Harry Hollien and Jerry Harnsberger, two researchers from the University of Florida, in 2006 during a email discussion with C David Watson, who was, at the time (2005) the Chief Operating Officer of a company called V LLC, the-then US distributor of Nemesysco’s system:
Thank you for providing us the opportunity to comment on the study protocols for Phase 1. We welcome a rigorous evaluation of Layered Voice Analysis (“LVA”) technology, and are committed to working with you and your team to develop a full and complete understanding of the capabilities of LVA.
As we have discussed with you, we have concerns that the Phase 1 protocol may not provide the necessary sampling to measure deception. This is based on our continued skepticism about the methods and protocols used to collect the sample statements.
LVA is designed to detect deception based upon identifying an individual’s intent to deceive. Based upon our understanding of the protocol, and after discussion with the developer and other scientists familiar with LVA, we question whether the voice samples to be used in this Phase reflect a true intent to deceive as measured by LVA.
As you know, the results of a previous study have been subject to extensive criticism because of the use of artificial attempts to create an equivalent to real-life deception. As we have stressed from the first meeting with DOD-CIFA, we believe that the analysis of voice samples of individuals in real life situations will provide the most accurate test of LVA’s ability to detect deception and other emotional/psychological states of the speaker.
The context to this email is that Hollien was about to conduct a rigorous and carefully designed scientific evaluation of Nemesysco’s system using strict test protocols. The full details of the methodology used are in the paper, which I’ll provide at the end of this article, but the short version is that Hollien and Harnsberger intended to use the system to analyse a set of pre-recorded speech samples produced by volunteers under the following conditions:
• Baseline calibration: The subject read a standardized phonetically-based (unstressed) truthful passage, namely the Rainbow Passage [Nemesysco’s own procedure for generating a baseline reference sample]
• Procedure 1: The subject read a neutral (unstressed) passage which was truthful.
• Procedure 2: A passage was used wherein the speaker produced a lie while not experiencing significant stress.
• Procedure 3: The subject uttered untruths under jeopardy (see below).
• Procedure 4: Truthful speech was uttered at a relatively high stress level (i.e., stress induced by mild electric shock).
• Procedure 5: Untruths were uttered both under high jeopardy (as in Procedure 3) along with fear induced by the administration of electric shock (see below). It was by this procedure wherein jeopardy was created by two stimuli applied simultaneously.
• Procedure 6: Truthful utterances were produced but where the subject simulated speaking under stress while not actually stressed.
The primary purpose the study was to specifically test the system’s ability to detect stress, both of the kind associated with lying and stress caused by administering, or threatening to administer the volunteer with an electric shock and to ensure that there was no chance of operator bias influencing the result, the researchers decided, based on information contained in the systems own manuals, that a score on the system’s stress parameter of 30 or more would be treated as indication of stress and one below that score would be treated as unstressed.
The samples were, of course, fed to the system randomly such that the system operators had no way of knowing which of the six different types of speech sample the system was being asked to analyse at any given time.
In other words, this is precisely the kind of randomised, blind, scientific trial that Koskas tries to suggest that it is impossible to perform on the system, although his excuse differs from that given by Watson because he claims that the system is entirely operator dependent – which is not that far short of an open admission that the output of the system actually means only what the operator chooses it to mean – while Watson tries to suggest that the test protocol won’t work because the volunteers who recorded the sample would have lacked the necessary intent to deceive which the system needs in order to work properly.
Here’s what Hollien had to say by way of a response to Watson’s concerns:
Thank you for providing an initial draft of protocols you feel useful for the testing of Layered Voice Analysis. In your comments preceding these protocols, you raised an issue about the methods we used to elicit samples for testing “voice stress analysis” software, including LVA. Specifically, you suggested that we did not verify that the samples were produced with a “true intent to deceive.” We respectfully respond that the “intent” of speakers is information unavailable to anyone attempting to evaluate LVA — or, for that matter, by anyone for any purpose whatsoever. “Intent” refers only to the speaker’s motivations to produce the speech sample and the thoughts/emotions/cognitive state of speaker during an utterance. Currently, no technology exists which is capable of “reading people’s minds” during any motor speech — or any other — activity. For example, even brain imaging technologies cannot be used to classify blood flow patterns into such specific “intents” as LVA purports to detect. And even to the limited extent that brain imaging technology can be used to observe cognitive states, it can only do so under extremely constrained laboratory conditions — and not at all in the “real world” situations you cite as being the “most accurate test of LVA’s ability to detect deception.”
And this leads Hollien to what, if you’re used to dealing with pseudoscience, is a very familiar situation:
Given the conflicting constraints you have suggested for a “fair” test of LVA (i.e., knowing the speaker’s intent while that individual produces lies in a real-world situation), it appears impossible to develop any procedure at all that could “test” LVA. In fact, by your own admissions, it would appear impossible to determine the validity of your system on any level.
Despite this disagreement over the test protocols, the research did go ahead and the experiment was carried out by two separate research teams, one from the University of Florida, which stuck religious to Hollien and Harnsberger’s test protocols and another, provided by V LLC, which consisted to their own trained operators working to their own protocols.
So the study both tested specifically for stress AND the manner in which it was carried out by two different research teams, one working to strict test protocols while the other did things the Nemesysco way also means that its test Koskas’s claim that its actually the operator and the training they’ve received from Nemesysco that makes all the difference.
In short, if Koskas’s argument is correct then the team of trained operators should prove to be much more successful in correctly identifying stress and deception in the test speech sample than the University team, which is relying on the systems’ output reaching a fixed and predetermined score in order to record a stressed response.
So, how did the two teams do?
I’ll just give you the aggregate results for all conditions here – the full results tables are in the paper.
The University team, working to the strict test protocols under which the system, alone, decided whether a sample was stressed or unstressed, correct identified 43.5% of the samples (giving either a true positive or true negative result) and, obviously, incorrectly identified 56.5% of the samples – and 61% of the system’s errors were false positives.
As for the team of trained operators, with all their experience, knowledge and training, the successful identified 48.5% of the samples, slightly better than the University team but not signficantly so and, in fact, a little lower than chance, which, of course, means that the got 51.5% of their answers wrong, and in four of the six sample categories, this teams false positive rate was higher than its true positive rate, in one case by 16%.
Naturally enough, these result lead Hollien and Harnsberger to the obvious, and only possible, conclusion.
The performance of LVA on the VSA database by both the IASCP and V teams was similar to that observed with CVSA. That is, neither device showed significant sensitivity to the presence of stress or deception in the speech samples tested. The true positive and false positive rates were parallel to a great extent.
[Ths study also tested a different ‘voice stress analysis’ system, ‘CVSA’, which ‘works’ on a very different – and no more valid – principle to Nemesysco’s system]
As regards Nemesysco’s system and the objections that C David Watson raised in regards to the test protocols, Hollien and Harnsberger have this to say:
For LVA to discriminate among a large set of cognitive states, it must be highly sensitive to whatever acoustic attributes of the speech signal cue those states. Presumed sensitivity at such levels suggests that LVA should be able to perform well with our laboratory samples as they contain both deception and documented levels of significant stress. However, LVA’s false positive rates were consistently higher than their corresponding true positive rates. When both of these rates were converted to a single d’, no actual sensitivity to stress and deception could be observed.
However, if it still is argued that the present laboratory protocols failed to elicit stress and deception that is sufficiently similar to stress and deception in a natural settings, the inclusion of unstressed speech samples and truthful speech samples in the database addresses this concern. That is, if measurable stress/deception are not present in these samples, LVA should not have detected stress/deception in any portion of them. In fact, roughly half of the unstressed and truthful samples were classified by LVA as stress and deceptive, respectively. A device that is, in fact, sensitive to these states should not falsely detect them if we actually failed to elicit these qualities when using our protocol.
Koskas would have the Guardian’s reader believe that Nemesysco’s system can’t be subjected to rigorous scientific testing, Hollien and Harnsberger have proved exactly the opposite and also produced results which show that the system is no more effective in detecting EITHER stress of deception than flipping a coin.
Of course, it doesn’t take a genius to figure out Koskas’s most likely response to this study – if it were put to him, he’d run with the same obviously line C David Watson put forward and suggest that the study in invalid because lab tests are not like real life.
So, in the next gripping instalment, we’ll be looking at how the system performed in real life, in a field test conducted on recently arrested prisoners at a US jail.
Hollien, H. and Harnsberger, J. D. (2006). Voice Stress Analyzer Instrumentation Evaluation.