Oooh – via the ever vigilant Google Alerts system and the Daily Mail it would appear that old ‘friend’ has resurfaced wearing a new, and rather cheap-looking, suit…
Could voice analysis software give away lying CEOS?
Ah yes, we’re in questions to which the answer is ‘No’ country right from the outset.
Scandals such as Enron have cost investors billions – and revealed that current methods of defecting boardroom fraud are sorely lacking.
Chief executives can’t simply be subjected to lie detector tests during investor calls.
Often, the only method for looking into their affairs is auditors manually looking through accounts.
Well, yes, that’s what auditors are supposed to do…
More often still, their dishonesty only comes to light afterwards.
Ah yes, t’is true…
But researchers at Duke University and the University of Illinois believe there is another way – using a technique called Layered Voice Analysis to listen for the telltale sounds of CEOs lying.
The researchers had an advantage: they could listen back to earnings calls where it was revealed afterwards that the CEO had been lying, and ‘tune’ their technique using convicted fraudsters.
Guess what’s coming next…?
The academics turned to an Israeli company called Nemesysco, and claim that they find ‘vocal dissonance’ in the CEOs speech at precisely the point when they were lying.
If true, the discovery could revolutionise the world of finance, and have ramifications in many spheres of life.
Yep… it’s the return of Nemesysco’s pseudoscientific ‘Layered Voice Analysis’ lie/stress/emotion/whatever detector is a brand new guise:- the same system that the Department of Work and Pension pissed away around £2 million of taxpayers money on before finally cancelling the entire shebang in November 2010.
Now, regular visitors may recall that I ran into a few hosting problems at the beginning of 2011 and, as a result, lost a small number of important articles including, unfortunately, the article which covered the DWP’s announcement that it was pulling the plug on Nemesysco and it UK partners, Digilog UK and Capita, so this presents me with a welcome opportunity to revisit that story in addition to tackling this latest development. Further background information can also be found by following this link although, again due to the crash, many of the links to pdf documents in these older posts will no longer work. However, because I’m something of a packrat, I still have all these documents on my own PC and I will be reintroducing several of them as we go.
The sudden re-emergence of this system stems from a new paper by academics at Duke University’s Fuqua School of Business and the University of Illinois which sports the following abstract:
We examine whether vocal markers of cognitive dissonance are useful for detecting financial misreporting, using both laboratory generated data and archival data. In the laboratory, we incentivize misreporting for personal gain, thereby generating an endogenous distribution of truth tellers and misreporters. All subjects are interviewed about their reported performance of a private task, much like managers are interviewed by analysts and auditors following an earnings report. Recorded responses to a series of automated and pre-scripted questions are analyzed using a vocal emotion analysis software that purports to capture negative emotions stemming from cognitive dissonance. We find the cognitive dissonance scores generated by the software discriminate between truth tellers and misreporters at the rate of 17% above chance levels. For the archival data, we use speech samples of CEOs during earnings conference calls and find that vocal dissonance markers are positively associated with the likelihood of adverse accounting restatements, even after controlling for financial accounting based predictors. The diagnostic accuracy levels are 8% better than chance and of similar magnitude to models based solely on financial accounting information. Our results from using both lab generated data and archival data provide some of the first evidence on the role of vocal cues in detecting financial misreporting.
Hobson, JL., Mayew, WJ., & Venkatachalam, M., “Analyzing Speech to Detect Financial Misreporting”, Journal of Accounting Research. doi: 10.1111/j.1475-679X.2011.00433.x
This paper is pretty much a sequel to an earlier (2009) paper by Mayew and Venkatchalam; “The Power of Voice: Managerial Affective States and Future Firm Performance” which will finally appear in the February 2012 edition of the American Finance Association’s Journal of Finance.
Now, to be scrupulously fair to any media organisations that have been taken in by this paper, there is much to be taken in by unless you know exactly what you should be looking for.
The paper’s authors are all bona fide academics from credible universities. Venkatachalam is a professor, and Mayew an assistant professor, at the Fuqua School of Business at Duke University, a private research university in North Carolina that is regularly ranked in the top 20-25 universities in the world, while Hobson is an assistant professor at the University of Illinois, a university that has been dubbed a ‘Public Ivy’, i.e. a university which provides an Ivy League experience at a public school price, and also a university that has been assessed as being amongst the top 20 major research universities in the United States.
As for the journals in which these papers have secured publication; the Journal of Finance has an impact factor 4.151, putting it second out fo 76 journals in the business/finance category and 5th out of 305 in economics, with the Journal of Accounting Research weighing in at 4th place on the business/finance list with an impact factor of 3.346. Both journals make the list of 45 titles used by the Financial Times to put together its business school research rankings and both are included in Bloomberg Businessweek’s top 20 journals.
So, both papers originated with academics at prestigious universities and both are published in prestigious and influential peer reviewed journals and yet both – if you examine them closely – amount to little more than exercises in utterly vacuous gibber.
If you’ve been paying attention, you may already have spotted the problem but if you haven’t then consider carefully just exactly what kind of journals these papers are published in and just exactly what kind of academics are listed as authors. Prestigious as these journals may be, they are financial journals and our three academics are all professors of accountancy for that they hold down positions at prestigious universities. None of them have any substantive background in linguitics or any other branch of speech science, nor in any relevant field such as cognitive psychology – in short they are entirely ill-equipped to assess or understand the most basic question that underpins their research, does Nemesysco’s system actually work and can it actually do any of the thing that its developers claim it can do?
This is the $64,000 question and one that the paper does acknowledge…
We focus on measuring emotions stemming from cognitive dissonance through the vocal channel for three reasons. First, Javers (2010) notes that former CIA agents hired by equity research firms search earnings conference calls specifically for markers of cognitive dissonance. Second, we have access to a commercial software product that purports to capture emotions related to cognitive dissonance. This software has recently been used in archival research to study the information content of executive emotion profiles during earnings conference calls (Mayew and Venkatachalam, 2011). Third, recent experimental research by Mazar, et al. (2008) directly links misreporting and cognitive dissonance. Mazar, et al. (2008) discuss the aversive feeling experienced by an individual during or after a dishonest action and argue that this aversion results because individuals generally view themselves as being honest and value this self-concept. They find that, in a setting where subjects are given incentives to misreport performance for personal gain, simple reminders of the emotional costs of deviating from the self-concept of honesty (i.e. cognitive dissonance costs) substantially dampen individuals’ propensities to misreport.
If you can manage to follow Mayew and Venktachalam’s paper then you’ll find, from page 24 onwards, a quite fascinating exercise in overlooking the obvious for the sake of pursuing an unproven hypothesis to the bitter end. The fun starts with the following observation:
As expected, the coefficient on unexpected earnings (UEt ) is positive and statistically significant suggesting that the market responds significantly to the extent of news contained in the earnings announcement. Consistent with Davis et al.’s (2007) analysis of earnings press releases, we find a statistically significant positive (negative) relation between the POSWORDS (NEGWORDS) and contemporaneous returns.
Translating that into English, what it means is that business analysts tend to pay particular attention to the kind of language used by CEOs when discussing company performance with investors at quarterly conference calls. If a CEO delivers lots of positive words and statements then the breeds investor confidence and vice versa, all of which fall into the general category of ‘stating the bleeding obvious’.
However, when they looked at the readouts from Nemesysco’s box of tricks what they found is that when it said a CEO was feeling upbeat during the conference call, the company went on to do well in the eyes of investors but when it said that the CEO was in fact on a bit of downer, regardless of what they actually said during the call, this didn’t appear to have much effect on investors at all, and from this observation they constructed the following hypotheses to explain why their results weren’t quite what they expected.
There are two possible explanations for the weak result for NAFF (Negative Affect). Investors may be optimistic, on average, and fail to incorporate the negative affective state in comparison to the positive affective state. An alternative explanation is that analysts are not scrutinizing enough in their exchange with management during earnings conference calls that might invoke the negative affective state. To test these competing explanations we identify situations where the analysts are most likely to scrutinize and interrogate managers during conference calls.
And sure enough, when they went back and looked at the conference calls where the software said that he CEO was unwittingly giving off bad vibes and found that it was only when investors gave the CEO a hard time, typically because their numbers didn’t match up to previous earnings forecasts, that this generated a negative impact on market confidence in the company.
Now, the obvious conclusion from all this is that whatever the hell theses researchers might think that software was telling them about the CEO’s state of mind in these call, it had absolutely no bearing whatsoever on the views that business analysts formed from the content on the calls – content is king and the emotional cues that the software is supposedly picking up are irrelevant, meaningless or even entirely non-existent.
That’s however, is not the conclusion that our intrepid researchers arrived at, via some fairly tortured ad-hoc hypothesising which runs from…
Recent survey evidence by Graham, Harvey and Rajgopal (2005, p. 42) points to such a situation:
“CFOs dislike the prospect of coming up short on their numbers, particularly if they are guided numbers, in part because the firm has to deal with extensive interrogations from analysts about the reasons for the forecast error, which limits their opportunity to talk about long-run strategic issues”. Therefore, we posit that managers of firms who miss analysts’ earnings benchmarks are most likely to be extensively interrogated, in turn invoking affective states. [No Shit, Sherlock – U.]
to…
While there are no observed differences in the market perceptions of positive affective state across scrutiny conditions (an F-Test for the equality of the coefficients on PAFFHS and PAFFLS cannot be rejected), the market does not react to negative affective state for firms in low scrutiny conditions (NAFFLS = 0.0432, p-value > 0.10) and an FTest for the equality of the coefficients on NAFFHS and NAFFLS is rejected (p-value < 0.01). These results imply that the market reaction to negative affect is statistically greater, and only exists, when firms are in high scrutiny conditions.
So, its actually the fact that CEO’s tend to get a bit pissed off when they’re given are hard time by investors, even if they do their level best to conceal their feelings, that knock markets confidence in their company and the not the fact that investors might have good reason to give the CEO a hard time because the company has not delivered on its quarterly forecasts or otherwise lived up to market expectations.
Yeah, right… and these guys teach accountancy?
This is nothing more than bad science. The data didn’t support the hypothesis but rather than reject the hypothesis the researchers chose to try to retrieve the situation by tacking on additional ad hoc hypotheses to shore up the one that failed but when we return to the paper by Hobson, Mayew and Venkatachalam we find that it make the following meretricious claim:
Archival work by Mayew and Venkatachalam (2011) finds that vocal emotion cues exhibited by managers during earnings conference calls have information content.
Not, that’s not what their paper finds at all, for the simple reason that it fails entire to control for the known, and in this context, confounding effects of the postive and/negative language used by CEOs/CFOs in these calls, separating out these influences from those that the researchers presume to be attributable to these vocal emotional cues.
Sadly, in addition to that very basic error, these papers also contain notes which show fairly clear signs of softpedalling and cherrypicking in relation to other research undertaken using Nemesysco’s system. On page 9 of Hobson, Mayew and Venkatachalam(2011) footnote 3 states that:
Gamer, Rill, Vossel and Godert (2006) experimentally investigate LVA based cognitive dissonance levels as part of an overall assessment of all LVA metrics provided in an early version of the LVA software and find them to be higher for participants in the guilty condition than for those in the innocent condition, but not statistically different.
By way of comparison, the abstract for Gamer et al. (2006) provides a much starker assessment of their findings:
The Guilty Knowledge Test (GKT) and its variant, the Guilty Actions Test (GAT), are both psychophysiological questioning techniques aiming to detect guilty knowledge of suspects or witnesses in criminal and forensic cases. Using a GAT, this study examined the validity of various physiological and vocal measures for the identification of guilty and innocent participants in a mock crime paradigm. Electrodermal, respiratory, and cardiovascular measures successfully differentiated between the two groups. A logistic regression model based on these variables achieved hit rates of above 90%. In contrast to these results, the vocal measures provided by the computerized voice stress analysis system TrusterPro were shown to be invalid for the detection of guilty knowledge.
TrusterPro is an older version of Nememsyco’s LVA software but one that nevertheless based on the same patent and underlying principles.
Elsewhere an internet appendix to Mayew and Venkatachalam (2011) includes the following description of a 2009 paper by Francisco Lacerda, Professor of Phonetics at Stockholm University:
Lacerda (2009) reviews the available public patent information on the LVA technology and concludes that LVA technology cannot work because it does not extract relevant information from the speech signal. The paper shows real world hit rate data from LVA usage by the UK’s Department of Work and Pensions. Overall areas under the ROC curve are 0.65 overall and reach values as high as 0.73.
This is what Lacerda actually had to say on the subject of the DWP trial:
The UK’s Department of Work and Pensions has recently published statistics on the results of a large and systematic evaluation of the LVA-technology assessing 2785 subjects and costing £2.4 million. The results indicate that the areas under the ROC curves for seven districts vary from 0.51 to 0.73. The best of these results corresponds to a d’ of about 0.9, which is a rather poor performance. But the numbers re-ported in the table reflect probably the judgements of the “Nemesysco-certified” personnel in which case the meaningless results generated by the LVA-technology may have been over-ridden by personnel’s uncontrolled “interpretations” of the direct outcomes after listening to recordings of the interviews.
An AUC (area under the ROC curve) of 0.5 indicates random chance and its entirely noticeable that the appendix cherrypicks the best result from the seven stage 1 trial districts while omitting to mention that four of the seven areas returned AUC scores of between 0.51 and 0.51, including three of the four largest trials (Jobcentre Plus, Lambeth and Harrow) which accounted for 77% of the data generated by this stage of the evaluation.
I’ll be returning to Lacerda’s paper and to the DWP trials in due course but should, at this point, declare an interest inasmuch as I did correspond with Prof. Lacerda at a couple of points during my investigations of the DWP trials and found him most helpful throughout.
So, in all we have a number of red flags to attach to these new papers, from the evidence of ad hoc hypothesising and failure to control for what should be an obvious source of confounding to the softpedalling of the findings of Gamer et al. (2006) and the cherrypicking of the DWP data in Lacerda (2009), and to this we must now add Lacerda’s withering assessment of Nemesysco’s system from that same paper:
The essential problem of this LVA-technology is that it does not extract relevant information from the speech signal. It lacks validity. Strictly, the only procedure that might make sense is the calibration phase, where variables are initialized with values derived from the four variables above. This is formally correct but rather meaningless because the waveform measurements lack validity and their reliability is low because of the huge information loss in the representation of the speech signal used by the LVA-technology. The association of ad hoc waveform measurements with the speaker’s emotional state is extremely naive and un-grounded wishful thinking that makes the whole calibration procedure simply void.
This last statement, in regards to calibration, is not entirely correct for reasons that will become clear shortly. However, Lacerda continues:
In terms of “lie-detection”, the algorithm relies strongly on the variables associated with the plateaus. Given the phonetic structure of the speech signals, this predicts that, in principle, lowering the fundamental frequency and changing the phonation mode towards a more creaky voice type will tend to count as an indication of lie, in relation to a calibration made under mo-dal phonation. Of course this does not have anything to do with lying. It is just the consequence a common phonetic change in speaking style, in association with the arbitrary construction of the “lie”-variable that happens to give more weight to plateaus, which in turn are associated with the lower waveform amplitudes towards the end of the glottal periods in particular when the fundamental frequency is low.
The overall conclusion from this study is that from the perspectives of acoustic phonetics and speech signal processing, the LVA-technology stands out as a crude and absurd processing technique. Not only it lacks a theoretical model linking its measurements of the waveform with the speaker’s emotional status but the measurements themselves are so impre-cise that they cannot possibly convey useful in-formation. And it will not make any difference if Nemesysco “updates” in its LVA-technology. The problem is in the concept’s lack of validity. Without validity, “success stories” of “percent detection rates” are simply void. Indeed, these “hit-rates” will not even be statistically significant different from associated “false-alarms”, given the method’s lack of validity. Until proof of the contrary, the LVA-technology should be simply regarded as a hoax and should not be used for any serious purposes (Eriksson & Lacerda, 2007).
The final reference is for this paper – Eriksson, A. and Lacerda, F. (2007). Charlatanry in forensic speech science: A problem to be taken seriously. Int Journal of Speech, Language and the Law, 14, 169-193 – which you won’t find in the online edition of the International Journal of Speech, Language and the Law as this is the paper that was somewhat infamously removed from the site after Nemesysco threatened the journal’s publish with a defamation suit.
Now, according to Lacerda, Nemesysco’s software ‘works’ as follows. It take is a low quality audio signal, processes in a manner which degrades the signal even further, stripping it of any potentially meaningful information content, and then subjects this degraded waveform to a bunch of meaningless mathematical processes generating a bunch of numbers that are, in effect, nothing more than statistical noise. This noise isn’t entirely random, as such, but it nevertheless contains no useful information content.
So, if this is indeed the case, how the hell have these and other researchers managed to produce results which appear – if taken at face value – to show that the software can detect certain cognitive states at levels better than chance?
Well, to understand how this happen the first thing you need to know is that, overall, there is one obvious and consistent pattern that emerges when you look at the all the available research in which this system has been used. It only ever appears to generate positive, i.e. better than chance, results in studies in which Nemesysco or its vendors/licencees have had at some direct involvement; and, sure enough, the paper by Mayew and Venkatachalam includes the following statement amongst its acknowledgements:
We appreciate the assistance of Amir Liberman and Albert De Vries of Nemesysco for helpful discussions and for providing the LVA software for our academic use.
BTW, if you do happem to spot any references to De Vries which contain the letter ‘PhD’ ignore them – De Vries’ ‘doctorate’ is a Gillian McKeith correspondence degree from an unaccredited institution – Newport University, California (not Wales) – which now no longer offers any degree programmes.
Conversely, as you might expect, studies conducted independently of Nemesysco and its employees, vendors and associates, tend to find that the system performs no better than chance, although here I would add one note of caution; some of Nemesysco’s past vendors/associates have been, shall we say, less than forthcoming about their links to the company, e.g. Drs. Guy ILM Van Damme and Prof. Herman Conradie, so its always advisable to check the provenance of the author particularly in relation to any research promoted on Nemesysco’s own website.
Failure to provide independent replication of result is, of course, another important red flag but, on its own, its doesn’t explain exactly how or why these replication failures occur. To understand that we need to look at three further research papers – the last ones for this article, I promise – papers which provide a useful insight into how this system may actually function, if function is the right word for it.
Paper number one is a recent study by Aaron C Elkins, a postdoctoral researcher at the University of Arizona’s National Centre for Border Security and Immigration, or maybe that should be the US Department of Homeland Security’s National Centre for, etc. as that’s where the funding for this research centre actually comes from.
Elkins’ paper is entitled ‘Evaluating the Credibility Assessment Capability of Vocal Analysis Software‘ – yes, the Department of Homeland Security have been playing with Nemesysco’s system, so any US visitors should be starting to feel just a bit unnerved at this point – and it was presented, in January 2010 at a Credibility Assessment and Information Quality in Government and Business Symposium at the 43rd Hawaii International Conference of System Science.
In this paper, Elkins reports that Nemesysco’s software performed no better than chance but also added a very interesting observation to his findings:
Mirroring the results of previous studies, the vocal analysis software’s built-in deception classifier performed at the chance level. However, when the vocal measurements were analyzed independent of the software’s interface, the variables FMain, AVJ, and SOS significantly differentiated between truth and deception. This suggests that liars exhibit higher pitch, require more cognitive effort, and during charged questions exhibit more fear or unwillingness to respond than truth tellers.
Never mind that Elkins thinks his results suggest about liars, the key thing to note is that the software’s built in deception classifier failed the test but that Elkins did manage to generate positive (above chance) results for three of the system’s mathematical parameter by conducting a post-hoc analysis of the systems raw output.
Now, if you were to ask Nemesysco to explain this they would undoubtedly claim that this shows that their system does actually work and that it only failed the main test because Elkin hadn’t calibrated the software correctly. At face value, that sounds pretty plausible, at least until you read this 2003 study/report by Brown et al. which tested an earlier version of Nemesysco’s system, ‘Vericator’, for it potential use as a sentry system for the US Department of Defence’s Polygraph Division. As with Elkins’ study, the software failed the primary test but, on a post hoc analysis of its raw outputs, generated result that led the researcher to the view that they had miscalibrated the system and that it would have passed the test had it been set up correctly. However, the researchers here also note that:
A major caveat must be placed here. We used logistic regression analyses to fit our data to known and desired outcomes. This heavily biased the outcomes to yield the most favorable In order to test the derived decision algorithms’ respective accuracies without bias, a new study would have to generate new data to test the algorithms. This was well beyond our original intent and scope. Another point of concern we have with the ability to generalize these results to new data stems from the fact that the derived decision algorithms for Scripted and Field-like questioning were quite different. The Scripted algorithm used eight of the nine Raw-Values parameters while the Field-like algorithm used only four. This raises concerns that these algorithms are based upon highly variable data. As a result, cross-validation with new data is questionable.
Taken together these results are entirely consistent with Lacerda’s view that the only thing that Nemesysco’s software actually generate is meaningless, information free, statistical noise, noise in which patterns which appear to show that the system is capable of performing at levels above chance can be reliably located by mean of a post-hoc regression analysis based on known and desired outcomes. If you know the answers that your looking for then its possible to find patterns in the data that appear to fit the answers but, and this is crucial, the can only be done reliably after the fact because the system does not generate these apparent patterns in any consistent manner.
This brings us to our third and final paper, another failed independent study by Harry Hollien and James Harnsberger of the University of Florida but one which is notable not only for its results but for an appendix (C), starting on page 54, which details the correspondence which passed between the researchers and a company called V LLC who were, at the time, Nemesysco’s main US vendors/partners. In the body of the paper, Hollien and Harnsberger describe their dealings with V LLC as follows:
The LVA analysis itself was conducted differently by the two teams of evaluators, the IASCP team and the Nemesysco, or V, team (e.g., two operators representing the manufacturer). The IASCP team at the University of Florida developed a protocol that did not require judgments by humans. This protocol was based on the training received by the two members of the team who are currently certified to use the device. The protocol varied depending on whether or not LVA was being operated to detect deception or stress…
[I’ve skipped the technical details of how thr IASCP team devised its operator free protocol]
…For both the deception as well as the stress analysis, trained LVA operators collated the results for submission to descriptive and statistical analysis. Given this approach, no interpretation of waveforms or waveform processing was necessary by the IASCP operators. Thus, their LVA analysis was conducted automatically without any operator “bias” or effects.
The V team did not follow the same protocol as that developed by the IASCP team. Over the course of the study, the IASCP group was unable to reach agreement with V, LLC (the distributor of Nemesysco’s LVA software in the United States) on the analysis protocol reported here (see Appendix C for documents related to the relevant discussions with V). Ultimately, the V team conducted its own LVA test of the VSA database while at the University of Florida site. The V team did not use a consistent protocol with all samples and, therefore, no attempt to document their operation of the device can be made. However, these operators were both highly experienced users selected by the manufacturer. Thus, it can reasonably be expected that the V team’s use of the device was within the manufacturer’s guidelines.
Although there were some differences in the results generated by the two teams, the overall results for both teams were sufficiently similar for Hollien and Harnsberger to conclude that:
In summary, their true positive and false positive rates were similar enough to suggest that the LVA was not sensitive to either deception or stress in these speech samples.
That said, and as I’ve already noted, the real paydirt is to be found in the correspondence, particularly from the point at which V LLC’s representative makes the following statement in a letter addressed to Harry Hollien:
As we move forward on Phase 1 of the study, we will attempt to identify what LVA components can be used to best measure the subjects’ states under the conditions you created in your laboratory. Based upon the little information we now have, given our inability to preview any data segments, the following is what we can tell you at this time.
What follows on from this statement is, for the most part, a bunch of technically looking gibber much of which it given short shrift by the researchers in their reply but the important thing to note is the reference to the company’s ‘inability to preview any data segments’ which prompts them send the researchers a stream of guesstimates for different ‘parameters’ many of which have nothing whatsoever to do with the protocols proposed by the researchers.
What this suggests, contrary to Lacerda’s view, is that the ‘calibration procedure’ isn’t really void. It is meaningless, in the sense that it doesn’t actually generate any meaningful information from the statistical noise generated by the software but it does serve a purpose, that of adjusting the software’s detection systems to locate, or rather attempt to locate, patterns in the noise that produce the appearance of meaningful outputs.
This accounts for the fact that the system always, or almost always, fails to deliver in independent studies where the software’s detection setting are configured using the information contained in Nemesysco’s manuals but not on the basis of bespoke setting supplied by Nemesyco and/or its vendors, settings which the company are likely, if not certain, to have arrived at by conducting their own post-hoc analyses of reference samples.
It also accounts for the extreme variability of full results of the DWP’s trials, for which I obtained the full data via the Freedom of Information Act – well, almost full data as 11 of the 34 phase 2 pilots failed to produce the management information required by the DWP in time for their results to be validated, although I did obtain partial data for one of these 11 sites via an FOIA request to the relevant council.
Of the 31 trials for which I obtained data, including the 7 phase 1 trials and the one for which partial data was supplied under FOIA, 14 were outright failures with five producing results that indicated that the system has actually performed worse than chance and, overall, there was no consistent pattern in performance between sites.
If I’m perfectly honest, the main conclusion I can draw from the data I have is that whole thing was all a bit of mess.
Of the 3,000 people that were flagged as ‘high risk’ during the phase 2 trial, almost 1,400 were not subjected to any follow-up checks at all – presumably these were people who were so obviously not telling porkies that the councils involved in the trial decided it wasn’t worth checking their claims.
Of the one’s that were checked, almost 60% were found to be entirely kosher and just over 1 in 20 had their benefits increased after their claims were reassessed, roughly the same number as had their claims terminated outright.
Combining all trials for which I have data, the specificity of the system, i.e. how accurately it identified people whose benefit claims were reduced or terminated following a follow-up review, was only 51% although it did somewhat better with people who judged to be low risk claimants, getting 78% of those assessments right and its overall performance was within the bounds of a chance result and even these result leave some questions unanswered such as whether and to what extent claimants who were found to have been paid too much in benefits were overpaid due to their having given councils false information as against having simply made a mistake on their claim. Housing Benefit, in particular, is a horrifically bureaucratic system to deal with for anyone who’s income can be even the slightest bit variable from week to week or month to month.
Assuming that I am correct and that Nemesysco’s system ‘works’ on the basis of what amounts to nothing more more than a form of statistical paredoilia, i.e. by finding plausible looking patterns in statistical noise, then the question remains as to how and why our accounting professors have been so readily taken in by the system.
The answer has, I think, much to do with the nature of accounting and business/financial analysis itself.
Markets, like Nemesysco’s LVA software, generate massive amounts of statistical noise. In fact, if you look at something like the FTSE 100 in terms of its daily or even hourly movements as a graph then you’ll immediately notice just how noisy the graph is even if its often possible to spot obvious trends, ie periods spanning weeks or months in which the overall trend it clear either upward or downwards.
Financial analysts and investors who profess to study the markets spend an awful lot of time looking at noisy data and, more than that, looking for patterns in that noise which might hep them to make accurate predictions about how a particular market or individual stock may move over the coming days, weeks or months; any kind of pattern that might give them an edge over other analysts and investors which can turned into a profit either directly, in terms of making their own investments, or indirectly, by selling their advice, building a bankable reputation or making a commission on trades based on their predictions.
As a very basic level, this type of activity is not so far removed from the manner in which hardcore gamblers look for patterns in the spins of a roulette wheel, in football and racing results, or the way in which the reels of slot machine fall in an effort to come up with a foolproof system that will allow them to beat the odds. In reality, the only game in which its possible to beat the odds, and the house, is blackjack, which explain why casinos put some much time and effort into detecting the use of card-counting but even though this is a relatively well known, it doesn’t dissuade gamblers from working on their ‘systems’ for other games of chance.
All three of our academics work in a field in which at least some of their work involves looking for meaningful patterns in the noise that companies and markets generate and this, I suspect, gives them something a blind spot when they’re confronted by a device, Nemesysco’s LVA system, which purports to be capable of doing much the same thing, only in this case its processing audio signals and not financial returns, trading movements or share prices. These academics expect to find a pattern and, when that pattern doesn’t materialise, their experience of markets leads them to the mistaken belief that its absence indicates only that they’re not looking hard enough, or in a sophisticated enough way and maybe need to add more epicycles to their model, anything but the obvious possibility, that there is no pattern there in the first place and that the noise they see in front of them is just that – noise.
Oh, and just to correct the Daily Mail on one more point – the DWP did not pull the plug on its trial because of a ‘public outcry’, it pulled the plug because the system just didn’t work reliably, even if they dissembled on that point when they made the announcement.
Nice post!