Earlier this week, over at Liberal Conspiracy, I started to publish a comprehensive exposé of the DWP’s trial of a ‘voice risk analysis’ system on benefits claimants.
To follow the story so far, you’ll need to read:
Due to legal issues relating to a key piece of evidence, a 2007 journal article by two Swedish academics, Anders Eriksson and Francisco Lacerda, I’ve decided to publish the briefings that deal primarily with the scientific issues and evidence relating to the DWP trial here at the Ministry, where I can take a few more risks and generally hang it all out there without fear of causing major problems for everyone at Lib Con.
The key aim for this article, and for the articles that follow, is simply that of exposing the system that DWP is currently testing on benefits claimants for what it actually is; a crude foray into the use of pseudoscience that does nothing more than trick some claimants out of their benefits and expose others to unwarranted investigations which, in the worst cases, will result in them unnecessarily experiencing serious hardship.
So, lets get on with job by showing how this technology allegedly works and what it actually does, which is anything but what the DWP and its developers claim.
Voice Risk Analysis – How it is claimed the technology works.
We need to start somewhere and the best place to begin is with what the DWP is telling visitors to its website about this technology:
Voice risk analysis is a real time system that combines the measurement of physiological levels of voice stress with behavioural analysis and conversation management techniques to enable the detection of truthful statements. It is a proven technique used in the private sector, for example, in better assessing the risk associated with insurance claims.
Let’s decode that for you.
The system comes in two parts.
One is a software program which claims to use a process called ‘Layered Voice Analysis’ to identify the presence of stress in the voice of benefits claimants, when those claimants are contacting a local authority about the claim by telephone. The technology is owned by an Israeli company called Nemesysco and is based on a patent filed in 1997 by the companies owner, and Israeli nation named Amir Liberman and is marketed and distributed in the UK and the Irish Republic by a British company, Digilog UK Ltd, under an exclusive licence with Nemesysco.
According to Nemesysco:
The technology detects minute, involuntary changes in the voice reflective of various types of brain activity. By utilizing a wide range spectrum analysis to detect minute changes in the speech waveform, LVA detects anomalies in brain activity and classifies them in terms of stress, excitement, deception, and varying emotional states, accordingly. This way, LVA detects what we call “brain activity traces,” using the voice as a medium. The information that is gathered is then processed and analyzed to reveal the speaker’s current state of mind.
So, the system claims to be able accurately identify stress and a variety of other mental and emotional states purely by analysing the acoustic characteristics of an individual’s speech and it allegedly does this regardless of the language in which the individual is speaking.
When used to assess benefits and insurance claimants the basic idea. like that of the much better known polygraph, is that people experience psychological and psyiological stress when they lie or provide false information and that this can be detected by identifying involuntary changes in their speech that the individual is, themselves, unaware of.
The second element of the system, which the DWP calls ‘behavioural analysis and conversation management’ is also called, by Digilog UK, who developed this side of the system, ‘Narrative Integrity Analysis Techniques’. This, like so many other supposedly new business processes with a fancy-looking acronym (NIAT) is a classic case of ‘The Emperor’s New Clothes’, albeit one with some scientific validity.
‘Narrative Integrity Analysis Techniques’ is no more than potentially trademarkable name for a loose collection of well understood witness interview techniques developed in the mid 1980’s by psychologists in the United States for use in law enforcement. How this all works, in the context of witness interview, is that the witness is prompted to go over their account of an incident or crime several times, during which the interviewer prompts them tell their story in a slightly different way in order to jog their memory and, hopefully, shake loose any minor details that might prove helpful to the investigator that the witness may not have recall is they were just asked, ‘tell what you saw…’
In the context of assessing an insurance or benefits claims, the same basic techniques are used but what the ‘investigator’ is looking for are an minor inconsistencies in the story, as its retold, and any ums, ahs and pauses that might indicate the claimant may be having to think carefully about what they say next in order to get their story straight. In practice this can be quite effective both for pulling out more detail and detecting inconsistencies that may indicate that some is, perhaps, being less than completely honest, but its also nothing new, special or innovative as it almost entirely based on research that now 20-25 years old, at least, and such practices have been commonly taught to and used by the police for most of that period.
So, that’s what the DWP is getting for its money, an expensive piece of software and a bunch of well known/understood police interview techniques dressed up with a fancy new ‘buzzword bingo’ name – and we’re specifically interested in here is the software program and the claim that it can reliably detect stress associated with lying in speech sampled from a telephone call.
The Science of Vocal Lie Detection
If science is not your personal thing then you may want to skip this next bit as there’s, unfortunately, no other way of debunking the claims made for this technology without covering a bit of theoretical background. That said, before you skip forward to the hard evidence we’ve uncovered from independent scientific evaluations of the technology, you need to know one important fact, and this is it:
There is no known scientific theory which accounts for or supports the claim that this system, or any other existing technology, can identify stress or any other kind of mental or emotional state, by analysing the acoustic characteristics of an individual speech. None whatsoever.
So, that’s the humanities crowd safely moved on to the next bit, so let’s get down to reviewing the science of detecting emotions, stress and ‘states of mind’ in speech, a topic which has interested quite a few scientists, especially phoneticists and linguists, over the year but which has yet to produce any comprehensive and coherent theories that would explain how this done, by us humans, let alone how it could be done by a machine or a computer programme.
We’ll start by quickly reviewing what we do know.
We know that both the physiological and psychological processes involved in human speech are extremely complex and that speaking itself, the physical mechanics of producing intelligible noises that other humans can recognise and attach a meaning are highly complicated and require a considerable amount of physical and psychological control and coordination.
And that, is about as much as the developers of this technology have been able to put forward as a theoretical basis for the science that their system is allegedly based on. Quite literally, if you strip the information back to basics, the claim that the developers make amounts to:
a) We know that humans can detect emotional characteristic in speech.
b) We know that speech is a complex physiological process and that this process is closely controlled by the brain in order to make it work.
c) Things going on in the brain must have some sort of detectable effect on how we speak, which is what our technology picks up.
This ignores almost all of the actual scientific evidence relating to the human ability to detect emotions in speech which show that this is an extremely complex process which takes into account not only what we say but how we say thing in terms of changes in pitch and intonation, the rhythm of speech, pauses, hestiatations, ums and ahhs, and the nature and characteristics of language itself; its phonological and prosodic characteristics and, when we’re dealing with others, our familiarity and understanding of the language that’s being spoken.
There is no complete or comprehensive scientific theory which explains how humans detect emotions in speech, but there are a range of partial theories and other pieces of solid research that have fairly successfully identified many of the key elements and factors that play an important role in this process, all of which need to ‘work’ together in order to make this possible. You could say, using a cooking analogy, that we know pretty well what the ingredients are and that the recipe must be extremely complicated, but as yet, no one has successfully figured out what the recipe actually is, how the ingredients have to be combined and in what order and proportions, in order to bake the cake.
One thing we know very well, however, is that this process does rely heavily on the phological and prosodic characteristics of language and this is particularly relevant in regards to the claim that this technology is ‘language independent’ and will work no matter the language being spoken or whether the individual being ‘tested’ is a native speaker of that language or not, a claim that fundamentally contradicts what we know about how human detect emotions in speech, a process that is heavily language dependent.
It is actually quite easy to illustrate the fact that detecting emotional content in speech relies heavily on an understanding of a specific language and how this spoken, particularly by a native speaker, as we live in a culture that is rich in artefacts that rely entirely on this premise – and nowhere more so, perhaps, than in comedy.
Just think, for a second, about the role that vocal stereotypes play in comedy, particularly in jokes that play on common national stereotypes. A stereotypical comedy German is, in British eyes, rather aggressive, authoritarian and humourless, qualitities that comedians put over by speaking in a ‘comedy’ German accent in a manner which conveys the impression of aggression and humourlessness. In fact British comedy has an entire library of common vocal stereotype to call on; the Chinese are always depicted as highly excitable, the Japanese as being excitable and aggressive, Americans as brash and overbearing, the French as, at best, laconic and, at worst, more than a little bit sleazy and, of course, the Irish are given a singsong lilt that suggests childlike simplemindedness in keeping with their role as the butt of British ‘idiot’ gags.
These vocal stereotypes did not just appear from nowhere, no one has ever sat down and consciously tried to work out how to put over a particular comedy stereotype from first principles. Where they actually come from is from our own inability to accurately and reliably detect the subtleties of emotional expression in the speech of individuals who don’t speak our own language or, in the case of Americans and some other English-speaking countries, individuals who do speak the same language as us but in a somewhat different manner based on a different set of prosodic characteristices. Comedy Germans are portrayed as aggressive because that’s basically how Germans ‘sound’ to us due to our own very limited ability to detect subtle emotional characteristics in speech when spoken by someone of a different nationality, and it sounds that way to us no matter what the [German] speaker is actually saying. A Chinese newsreader will still sound excitable to us, even if they’re reading out a sombre obituary, and these misreadings of emotional content cut both ways, hence the legendary inability of many American’s to ‘get’ British ironic humour, particular of the deadpan variety.
If you’ve followed all that, you should have realised that detecting emotional content in speech is a phenomenally complex matter and one that, in scientific terms, we really do not understand with anything close to the kind of detail or clarity necessary to devise a machine or computer program that would be capable of accurate detecting even very basic and obvious emotions in speech, let alone track the kind of supposedly ‘minute’ changes in speech which the developers of this technology claim that their system can only pick up but also relate directly to stress, emotions and ‘anomalies in brain activity’. Right now, we cannot even reliably relate ‘anomalies’ in brain activity to speech using the best available neural imaging systems because we simply do knowe anything like enough about the relationship between the brain, how we think and how that relates to electrical activity in the brain and the physiogical processes which produce human speech to make any kind of accurate assessments on an individuals mental or emotional state of mind from anything we can detect using modern technology.
As far as ‘Layered Voice Analysis’, the technology used in the DWP trial, is concerned, the very small amount of theoretical background information given by the system’s developers indicates that almost everything that scientists have been able to establish, to date, about complexity of the process by which humans detect emotional content in speech has simply been ignored to the point that the owner of the company, Amir Liberman, who developed this system and who owns the patent on it, recently claimed, publicly, that the system has nothing whatsoever to do with either linguistics or phonetics. This, were it true, would mean that the system is based on what would amount to a completely new and previously unknown branch of science, and put his ‘invention’, and his discovery of this new branch science, somewhere on a par with Einstein’s special and general theories of relativity, quantum mechanics and chaos theory as a ‘scientific’ innovation.
Before we leave the realm of scientific theory, its worth noting that this is not the first mechanical system to have been marketed on the premise that it can detect the presence of stress related to lying and dishonesty by analysing the acoustic characteristics of speech.
Before ‘layered voice analysis’ there was ‘voice stress analysis’, a system developed during the late 1960’s by three US Army officers who, on leaving the army, marketed their system to law enforcement agencies and commercial business as a voice-based ‘lie detector’.
For our purposes we need not delve too deeply into the theory behind this system, which claims to detect stress by analysing changes in a low frequency subsonic physiological tremor which is supposedly present in human speech. What you need to know is simply that:
a) the tremor itself, is real enough, inasmuch as its existence was detected in the large muscles of the human body involved in movement back in the 1950s by a British scientist, Olaf Lippold, whose name the tremor now bears (i.e. The Lippold Tremor). This tremor, which is caused by the period contraction of muscle cells in, for example, the arm at a rate of 8-12 times a second is thought to be part of a feedback mechanism involved in fine motor control;
b) that said, a system based on detecting this tremor in speech could not be used in this or any other vocal lie detection system which takes it input from telephone calls as the tremor, if it were present in speech, would be found at a frequency so low that only a specialist microphone could pick it up and, also, at a frequency that is automatically filtered out of the signal transmitted by a standard telephone system; and
c) most importantly of all, in the last forty years only one research study has ever attempted to find this supposed tremor by seeking to detect its presence in the muscles of the larynx (this being how it supposedly affects the way we speak). This study, by Shipp and Isdebski (1981) not only failed to find the tremor at the predicted frequency range but also found that, when speaking normally, the laryngeal muscles move in such a complex and rapid manner that it would be effectively impossible to actually detect such a tremor either directly, using electromyography, or in a speech waveform, even if it did exist.
The company behind the system being trialled by the DWP does state, very clearly, that its system is not based on detecting the Lippold Tremor, which is just as well as this would be impossible over a telephone and because its also been found, by researchers, that at least one currently available voice stress analyser which does, very clearly, claim to use this type of detection, was in fact tuned to a completely different frequency to the one at which this tremor had found by Lippold.
That’s the theory done with, now for the hard evidence.
‘Layered Voice Analysis’ On Trial – The ‘Prosecution Case’.
Let’s talk about the scientific evidence that’s available relating to this technology, starting, in time-honoured legal fashion, with the case for the ‘prosecution’, which is based, primarily on three peer reviewed scientific studies:
a) a paper by Professors Ander Eriksson (University of Gothenburg) and Francisco Lacerda (Stockholm University) which was published in a small, specialist, scientific journal in December 2007 and then withdraw from the journal’s website in December 2008 after the publisher was threatened with a libel action by the company behind the system, Nemesysco, and its owner, Amir Liberman. Fortunately the paper found its way onto Wikileaks, for anyone who wants to work through the detail for themselves.
b) a detailed laboratory study by Hollien and Harnsberger (2006) which set out to evaluate whether the system could reliably detect stress levels associated with either truthfulness or dishonesty. This study is particular interesting because, unlike all of the papers provided on the company’s website, this study used a clear and empirically defined test protocol and because it includes, as an appendix, emails which clearly show the company which supplied the test system to the researchers, a now defunct US company which served as the licensed distributor of the system in the US at the time the research was conducted, attempting both to influence the test protocols in their favour and raising objections to the research which effective suggest, as is common in many other type of pseudoscience, that it effectively impossible to subject the system to genuine scientific testing.
c) a field study, conducted with funding from the US Department of Justice by Damphousse et al. (2007), which put the system to the test in a very real and very live setting – evaluating the truthfulness of newly arrested prisoners when questioned about their recent drug use.
That’s what we’ve got, other than as as yet unpublished technical paper, which has been kindly supplied by Professor Francisco Lacerda (without whose kind assistance this investigation would not have been possible), which adds a very useful layer of technical information to the overall picture but which, unfortunately, I cannot supply for the time-being as its currently under review for publication in a very large and well known scientific journal/magazine.
That’s the background, and having already covered the evidence which shows that there is no known scientific basis for any of the claims made for this systems ability to detect stress or any other emotional or mental state, what we can add to the evidentiary picture from these three, peer reviewed, papers is.
1. The main patent, which describes this technology (filed in Israel in 1997) claims that the system detects the ’emotional status of an individual based on the intonation information’ in the speech waveform it analyses. Francisco Lacerda describes this claim, in his unpublished technical paper, as being based on ‘a superficial knowledge of acoustic phonics’, backing up the assessment given in his joint paper with Ander Eriksson in which they point out that what the patent claims is ‘intonation information’ bears little or no relationship to the accepted definition of intonation used is the study of linguistics and phonetics.
2. Responding to the claim that the system detects ‘minute changes in the speech waveform’, Eriksson and Lacerda note that, in reality, the system actually works by capturing an already poor quality speech signal (i.e. the compressed signal from a telephone system), sampling it at a very low rate (11Khz, 8 bits per sample compared to a standard CD which is sampled at 44.1 Khz and 16 bits per sample) which the software then filters, effectively removing two thirds of the information contained in the sampled waveform. The effect of this processing, before the system even attempts to analyse the content of the signal, would, in many cases, result in the signal that the system processes being degraded to the point where is would be almost unrecognisable as speech.
3. Staying with the signal processing side of the system, Eriksson and Lacerda also note that there is no means by which the system could accurately distinguish human speech from any other sounds it might pick up, such a background noise, line noises, signal drop outs, etc, the presence of which could easily affect the scores output by the system at the time it analyses the signal, distorting the results. As such, the system would just as readily evaluate the sound of a dog barking or a lorry driving past for its ‘honesty’ as it would human speech.
So, if the system is not actually analysing intonation in speech and is, in fact, degrading the quality of the speech signal it is analysing out of almost all recognition before it does any kind of processing for stress or anything else, then what exactly does this system analyse?
The answer, according to the two professors, is nothing more than ‘digitisation artefacts’ – spikes and lulls in the digitised speech waveform which, in effect, exist only because an analogue signal – the sound we actually hear – has been converted into a very low quality digital representation of that sound; and while the digital version of this sounds recorded by the system will, obviously, bear some resemblence to the original analogue sound, it will also, due to the low quality of the sampling method used, contain spikes and lulls (the terms actually used by the developers are ‘thorns’ and ‘plateaus’) that have nothing whatsover to do with any of the speech content in the original signal.
So, to quickly recap, before the system even starts to do any kind of analysis, it has already stripped the samples it analyses of most of their useful and useable information and a sizeable proportion of whats left has little or nothing to with the actual speech content of the signal – so how exactly do you analyse that?
I’ve already mentioned two ‘features’ in the sample data, which the patent refers to as ‘thorns’ and ‘plateaus’, and its presence of these that the system analyses – but what are they?
Unless you’re really interested in the technical aspects of all this (in which case, read the paper on Wikileaks) then it doesn’t matter whether you understand what they are or how, exactly, they’re defined in the patent and by the system, and that’s because they are nothing more arbitrary and, therefore, entirely meaningless ‘features’ in the sample data used by the system – and its because of that that Eriksson and Lacerda have confidently asserted that not only does this technology not work but it actually cannot work no matter how much effort oit developer might put into ‘updating’ the software.
That is how the system analyses speech, it scans samples for the presence of arbitrarily defined and utterly meaning digitisation artefacts, records some fairly basic information about them, subjects that this information to a few basic statistical calculations, none of which would trouble a halfway decent GCSE student, and then spits out a set of meaningless numbers which, the developer claims, will tell you whether the individual is feeling stressed or experiencing any of a range of different emotional and mental states.
In practice, the system actually generates these numbers (four primary parameters and anything up to 18 secondary parameters) by, first, taking and analysing a reference sample at the start of the telephone call that the system will be used to analyse – and which it treats as an indication of the caller’s unstressed state of mind – and then comparing it analysis of samples taken later in the call with the reference sample. This method is unrelated to any known scientific law, theory or principle, but it does observe a well known law of computing – Garbage In, Garbage Out.
I could go on to give more technical detail on the inner workings of the system but, in reality, all you need to know and commit to memory in order to follow the rest of this series of articles is that, despite the fact that the system does process and analyse speech in a deterministic fashion, the essential meaninglessness of the processing it does and the method by which it produces its analyses means that, within certain experimentally definable limits, everything the system outputs by way of its parameter scoresm is effectively random,as Francisco Lacerda notes here:
“The overall conclusion from my study is that from the perspectives of acoustic phonetics and speech signal processing, the LVA‐technology stands out as a crude and absurd processing technique. It not only lacks a theoretical model to link its measurements of the waveform with the speaker’s emotional status but the measurements themselves are so imprecise that they cannot possibly convey useful information. And it will not make any difference if Nemesysco “updates” its LVA‐technology. The problem is in the concept’s lack of validity. Without validity, “success stories” of “percent detection rates” are simply meaningless. Indeed, these “hit‐rates” will not even be statistically significant/different from associated “false‐alarms”, given the method’s lack of validity.”
It is that last statement, on hit rates and false alarms, which brings us to the two independent evaluations of the system in use – Hollien & Harnsberger (2006) and Damphousse et al. (2007) – both of which are well worth looking over if you’re interested in the science, and definitely worth a look for the discussions of test protocols and the emails to and from Nemesysco’s then-US distributors.
However, for now we can concentrate simply on just three things that these two studies have in common:
1. Both studies were conducted independently of the company that supplied the system under test conditions which precluded that company influencing or introducing biases in to the result.
2. Both studies found that regardless of whether the system was given a truthful or dishonest statement to analyse, the system pretty much as likely to produce a wrong assessment (a false positive or negative) as a correct one.
3. Both studies concluded that the probability of the system accurately identifying a statement as either truthful or a lie was no better than chance – and specifically, no better than flipping a coin.
The experimental evidence from both studies, and these are the only credible independent experimental studies of the system to date, both back up Lacerda’s assertion that the method used by the system lacks any scientific validity.
The Case for the Defence
So, to begin the case for the defence, we should start with the nine studies which are linked to on the website of the company that developed this system, Nemesysco, each of which, in different ways, supposedly validates this system in scientific terms.
Nine studies might sound pretty good, given that we have only three, plus an unpublished paper, to support the prosecution case but what Nemesysci doesn’t disclose to visitors to its site, but for in the case of one study which is clearly identified as a piece of ‘internal’ research, is that every single one of these studies was either:
a) conducted by someone with a close, and often, financial relationship with the company or one of its immediate predecessors (before Nemesysco was set up in 2000, the system was owned and developed by another Israeli software company, Makh Shevet Ltd and marketed under the brand name ‘Truster’ by a related company, Trustech);
b) based solely on data provided by the company or one of its close associates; or,
c) conducted by researchers who were, during the study, wholly reliant on information provided by Nemesysco and/or its local distributor/agent for their understanding not only of how to use the system but of exactly how to interpret the results produced by the system.
And because of how the system actually works, in practice, this effectively means that the results output by the system can and do mean pretty much anything that the company behind it wants them to mean, as the email exchange in Hollien and Harnsberger’s study clearly demonstrates.
Since the publication of Eriksson and Lacerda’s 2007 paper, Nemesysco’s owner, Amir Liberman, has put forward three main objections to the two professors’ analysis of his system…
…in addition to threatening to sue the publisher of the journal in which the paper appeared in order to get it taken offline (hence the need to get it from Wikileaks).
These objections are:
1. Eriksson and Lacerda did not evaluate a genuine copy of the system/software when writing their paper. This is certainly true, as their analysis is based on the information contained in the patent and they used another specialist software program (Mathematica) to model the processing and analysis methods used in ‘Layered Voice Analysis’ from the 500 or so line of Visual Basic program code given in the patent as a description of the software’s analysis methods.
2. The real software has been updated several times since the patent was published, which means that Eriksson and Lacerda’s analysis and criticisms are completely out of date, and…
3. Eriksson and Lacerda only looked at one of the three patents Liberman currently holds, which implies that there is additional information in these patents that the two professors have failed to take into account.
In rebuttal, we can take objections 1 & 2 together because, as long as the technology is based on the basic method described in the patent then it make no difference whether a simulation was used to analyse its workings or whether the software has been update, because the reason it doesn’t work and cannot work is not down to programming errors or a lack mathematical and statistical information or even because of the poor quality of the samples it analyses. It doesn’t work simply because the method by which carries out its analysis is scientifically invalid and completely meaningless. The square root of nonsense is still nonsense no matter how you look at it.
As for objection three, yes there are two other patents.
One is for Liberman’s slightly notorious ‘love detector’, which used the same meaningless analytical methods as his ‘layered voice analysis’ system but comes, in product form, at a much cheaper price than the £10,000 each system is costing the DWP. If you’re Skype user, you can buy your very own ‘love detector’ for £29.99 and, all thing being equal, you’re just as likely to find out that your long distance girlfriend/boyfriend is a benefit cheat as the their local council and/or the DWP.
The other patent is, funnily enough, for an ‘internet matchmaking’ service which, as I’m sure you’ve already guessed, is nothing more than a bog standard online dating website with added ‘love detector’ software bolted on to it.
Neither of these two patents adds any meaningful technical information to the original patent nor suggests that any of the methods used in that patent had changed in the two years between the filing of the main patent (1997) and the filing of these two, later patents, in 1999.
So, the DWP claims on its website that this is a ‘proven technique’.
In reality, half of the system, the software which flags up claimants for further investigation is, in practice, nothing more than a very expensive bingo machine and if you’re unfortunate enough to be making a claim for housing benefit or council tax benefit to any of the councils involved in the trial then, in normal circumstances, the chance your number will come up is no different to that of any other claimant, whether you are as honest as the the day is long or trying to pull off a deliberate and calculated fraud.
That’s what the scientific evidence says.
In the next article in this series, we’ll pick up on the Guardian’s coverage of this issue, introduce you to one of the lesser known key players in the trial, Lior Koskas, and tell you a few things about his background that the Guardian didn’t have space to cover and we’ll also talk you through something that Koskas told the Guardian cannot be done…
…a rigorous scientific trial of the software used in the DWP trial.