The Independent’s ‘Lost Girls’: A Truly Epic Fail in Data Journalism

Okay, quick recap:

Yesterday, The Independent Newspaper published an article by its Science Editor, Steve Connor, in which its purports to have uncovered evidence of what it claims to be widespread use of illegal sex-selective abortions by ‘some UK ethnic groups’, all based on a bespoke data set extracted from the 2011 Census data by the Office for National Statistics at the newspaper’s request.

I, of course, had a quick look at these claims and came to the conclusion that there were a number of potential problems with the newspaper’s assumptions and analysis, but couldn’t be absolutely sure at the time because I’d not seen the actual data set on which the article was based. That’s now changed. It turns out that The Independent did, indeed, obtain its data using a Freedom of Information request although that was not readily apparent at the time as the data had been published by the ONS in a slightly out of the way section of its website that deals with ad hoc requests and which is only easily accessible from the page which explains its publication scheme rather than in its main subject-based disclosure log section.

So, thanks to a bit of detective work by Lisa Hallgarten, who spoke to both the journalist responsible for the article and the statistician, Christoforos Anagnostopoulos, that the Indie got to look over its figures, yesterday evening I received a copy of the full data set (.xls) that the newspaper had obtained from the ONS, allowing to carry out my own analysis of the same data.

However, before getting into what I found I want to show you a graph I generated from a different ONS data set, one which shows the annual sex ratio at birth for the UK between 1938 and 2011 and which will hopefully help to clarify the background context against which we need to assess the Indie’s claims:


I would have liked to have gone back even further than 1938; the ONS used to publish statistical data on birth registrations stretching all the way back to 1837, when official birth registration began, but that historical data set is sadly no longer available online thanks, in no small measure, to the current government’s efforts to ‘improve’ access to official statistics.

Nevertheless we can see in this graph that some rather odd things have happened to male-female sex ratio at birth* since the late 1930′s. If it not already obvious, the red line on the graph tracks annual variations in the sex ratio, the solid blue line shows a smoothed long term trend, while the dashed lines show the average sex ratio for the entire period covered by the graph (1938-2011) and for two specific periods within that data set, 1938-1979 and 1980-2011. The long term average male-female sex ratio for the entire period covered by the graph is 1.056, so on average there were 1,056 male children born for every 1,000 female children but, almost at the very beginning we see that the annual sex ratio rose very sharply in 1942-43 to a peak of 1,o65 males to 1,000 females and the annual ratio then remained above the long term average for all but three of the next 37 years until, in 1980 in fell significantly from 1,060 males per 1,000 females in 1979 to just 1,049 males per thousand females since which time the annual sex ratio has only exceeded the long-term average on one occasion (2007).

*This is the secondary sex ratio, one of four different male-female sex ratios that are commonly referred to in scientific literature. There is also a primary sex ratio, which is the ratio of males to female at conception, a tertiary sex ratio, which is also referred to as the adult sex ratio (ASR), which measure the ratio in sexually active organisms, and a quaternary sex ratio, which is male-female ratio in post reproductive organisms, although this is rarely used in human sex ratio studies as the boundary between is and the ASR is difficult to assess accurately.

So what we actually have here are two very distinct trends. Between 1942 and 1979 the average male-female sex ratio at birth in the UK was only a little under 1,060 males per 1,000 female but from 1980 onwards that average has dropped to a touch over 1,052 males per 1,000 females and this is not a result of any gradual change in the male-female sex ratio over time, the change in the long term trends, when it occurs, is extremely sudden and appear to have the basic characteristics of a sudden phase shift or tipping point event in physics.

So what caused these sudden shifts in the male-female sex ratio in the UK?

The truth is that no one really knows, and I should stress that the sudden downshift in the sex ratio at birth we can see in the UK data at the very beginning of the 1980s is not unique to Britain, similar sudden falls have also been observed in the male-female sex ratios at birth across a swathe of other Western industrialised countries at around the same time.

There is, in the UK data, a moderate correlation over time between the sex ratio and other statistical measures of fertility, specifically the total fertility rate (0.53), general fertility rate (0.58) and crude birth rate (0.56), which suggests the possibility of a general link between the sex ratio and overall birth rates and family sizes in a population and that is statistically very plausible. For humans the natural male-female sex ratio at birth is not, so far we can tell, 50:50. It’s actually around 51:49 in favour of males and if we think only in terms of homogeneous families, those that produce either only male children or female children then for those with only one child we’d expect to see our basic 51:49 male ratio but when we come to look at those with two children the laws of probability tell us that the ratio there should increase to around 52:48 and with three children it should rise again to 53:47 and again to 54:46 for those with four children. Of course, as the number of children increase the laws of probability also tell us that the the number of homogeneous families in the population should fall, simply because there are many more ways in which a family may consist of children of both sexes than there are single sex families, so there is a limit to the extent to which changes in birth rates and family sizes in a population can alter the male female sex ratio. It could be part of the answer here but it’s not the explanation, even if we allow for the fact that the 1980 shift in the UK ratio occurred at the end of a decade which saw a substantial fall in the UK birth rate and the general and total fertility rates arising, in the main, from women gaining control of their own reproductive capabilities via oral contraceptives, legal access to abortion, etc.

The most common/popular theories advanced to explain the sudden fall in sex ratios in industrialised countries have tended to focus on the possible impact of a range of environmental pollutants on male fertility, everything from industrial chemicals to an increase in female hormones in the water as a result of women’s use of oral contraceptives. These are all possibilities, and the biological mechanisms proposed by these hypotheses are all generally plausible but concrete evidence of a definitive cause or causes is, to date, still lacking and research into trends in human sex ratios in different populations has also turned up claims of associations between variations in sex ratios and paternal, but not, maternal age, gestational age at birth, maternal malnutrition, the population prevalence of the Hepatitis B virus, the social status of the mother (although the evidence suggests that affects pigs but not humans), maternal smoking, trends in marriage, i.e. the prevalence of early or late marriage, whether or not mothers have a partner or other support network, latitude, climate change (perhaps inevitably) and, of course, in some countries the practice of sex-selective abortion.

Some of these proposed association are, a priori, rather more plausible than others but none of them, to date, seem to offer an adequate explanation not only for observed changes in the overall trends in male-female sex ratio we can see in that graph but, crucially, for the apparent suddenness of those changes. For reasons that we really don’t understand, in 1980 British women started to produce an average of 8 fewer male children per thousand female children than they had during the period from the Second World War to the election of Britain’s first female Prime Minister and for all anyone can be absolutely certain the reason for that could even be that in addition to all her other manifest faults, Thatcher just had bad juju.

There is a lot that we simply do not understand about sex ratios at birth in our species, in fact we cannot even be 100% certain what the ‘natural’ sex ratio for our species is because, although its widely thought that anywhere from a 1,030 to 1,080 males per 1,000 females studies is roughly the natural range, scientific studies have found have significant variations in the natural, or at least average, ratios in both different populations living in different parts of the world and in the same populations over time to the extent that ‘natural’ sex ratios at birth have been found, historically, to have ranged anywhere from 940 males per 1,000 females to 1,150 males per 1,000 females for purely natural reasons.

The obvious lesson from all this is that when you start looking at the male-female sex ratio at birth in different populations you need to be extremely circumspect in any assumptions you choose to make about the possible causes of variations in those ratios and even about what may or may not be the ‘natural’ ratio for a given population and this particularly the case when you start looking at sex ratios in first and even second generation migrant populations because these are invariably very different, demographically, from both the host population of the country to which these migrants have travelled and the general population of the country they’ve migrated from.

A second, absolutely critical point in all this, is that the only way in which it is possible to calculate an accurate male-female sex ratio at birth for any given population is by using actual birth data, this being the method used by the Department of Health in compiling its report on sex ratios in different migrant populations in the UK and, with some limitations, by Dubuc and Coleman in what is still, to date, the only credible scientific study to find any evidence of the possible use of sex selective abortion in the UK albeit only in one very small and specific population, a sub group of women born in India but now living in the UK who were having at least their third child. Moreover, to carry out a thorough analysis one needs detailed birth data covering not only the number of births and sex of any children born in a particular population but also clear and accurate information on the birth order because, as Dubuc and Coleman found, it was only at higher birth orders (3 or more children) that any discrepancies in the male-female sex ratio in this one subgroup of first generation female migrants from India became apparent and the size of this discrepancy, which appear to amount at most to around 6o missing female births per year on current birth rates, is too small for it to have any discernible impact on the sex ratio in all children born to that population irrespective of birth order.

And, on top of all that, as should also be evident from Dubuc and Coleman’s paper, because the data on male-female sex ratio at birth in any given population is prone to significant year on year variations, we can only accurately identify an anomalous ratios that might possibly indicate sex selection if we have longitudinal data to work with and can securely establish that any apparent anomalies are consistent with an atypical trend over time in a particular population and not merely the product of natural/random annual variations in male and female births.

This brings me on to the data that The Indie obtained from the Office of National Statistics and the first to note is that, as seemed to be the case from the text of Steve Connor’s article, the census data on which the newspaper’s analysis – a term I use her in the loosest sense possible – relates solely to families with dependant children, and to be clear the legal definition of a dependant child in the UK is any child for whom a parent or guardian is notionally entitled to claim child benefit, irrespective of the recent changes introduced by the government in relation to high earning families. That includes, any children under the age of 16 but also any children aged 16-19 who are still in full time education at a 6th form or Further Education college or who are undertaking approved training – and from what I can these appear to be course intended to prepare young people for apprenticeships – but not those in Higher Education, i.e. attending a university, nor those that actually have a job or are at least notionally able to claim welfare benefits, irrespective of whether they actually receive any money from the state.

There is, therefore, a hell of lot that we simply do not know about these families and their dependant children.

We don’t know how old any of these dependant children are or when and where they were born; the only data provided in relation to the country of birth is for the male and female parents, not for any of the dependant children.

In the case of the migrant populations for which data is give, we don’t know when any of the adult parents originally came to the UK, whether they arrived here as a single person or as part of an established or married couple, We don’t whether they came to the UK as economic migrants, or as refugees, of with the intention getting married and with marriage partner waiting for them on their arrival.

We don’t know how many migrant families arrived in the UK with or without existing children, whether they we subsequently joined by any children that had and we still living overseas at the time they came to the UK, or whether any of these families have children that are still living overseas, or even how many children may have been born into any of these families since they came to the UK.

We don’t know if there are any non-dependant children that were born into these families but which do not show up in the census figures because they are now non-dependant adults who may either still be living in the family home or have moved out into a home of their either permanently or because they are attending a university.

We also don’t know if they are any other children who were born into the family who are not usually resident in the family home because they live with relatives, or have been taken into care or have been adopted or even whether there are any children who born into the family but who have since passed away, whether this may have occurred overseas or in the UK.

And, last but by no means least, we know that in the case of at least two of the nationalities for which country specific information was obtained from the ONS, Bangladesh and Pakistan, there is a known problem of uncertain scale arising from the practice of shipping minors back to old country to take part in arranged marriages, irrespective of whether this occurs with or without the express consent or, in some cases, even knowledge of the young person for whom a marriage has been arranged, an issue which is known to disproportionately impact on female children, although the limited evidence we do have on this practice does maybe up to 15% of such marriages involve male children leaving the UK.

These are all problems, and potential sources of confounding, which arise purely from The Independent having chosen to obtain a data set which is wholly inappropriate for the purpose for which it was obtained, even before we come to consider the deficiencies in the data itself.

It is not possible, for example, to calculate a male-female sex ratio for all dependant children in families all families for a specific population because the data set does not include a breakdown by sex for families in which there is only one dependent child nor for families with four or more dependent children and also, in this latter group, there is no breakdown of the exact number of families with any specific number of dependent children above four; four or more is all we get so we can’t even calculate the exact number of dependent children covered by the entire dataset.

There is a near complete breakdown by sex for families with two dependent children, so we know how many families have two male children, two female children and one of each, with separate figures given for families where the the oldest dependent child is either male or female, but there is no male-female breakdown give for families with same-sex twins. As for the data for families with three children, the breakdown provided cover families with either three male or three female children, families where the oldest two children are either male or female, which would presumably mean that the youngest child is of the opposite sex to the other two, and families where just of the oldest child is either male or female, from which we can infer the sex of the second oldest child as being the opposite to the oldest but nothing at all about the sex of the youngest, which could be either male or female and so, as with the data the for families with just one dependent child or four or more dependent children, we cannot actually calculate exactly how many children in the families with three dependants are either male or female.

This, of course, severely limits the extent to which any kind of analysis of male-female sex ratios can be carried out using this dataset but unlike the problems of confounding, which are inherent in the data itself, the problems here seem most likely to stem from rank incompetence in framing the questions put the ONS when requesting the dataset, unless it turns out that the ONS has deliberately chosen interpret those questions in the most obtuse manner possible in order to screw the Indie over, which has been known to happen with other agencies – one of the the councils I contacted for data on the absurd DWP ‘lie detector’ tried that on with me – but is not something I could imagine the ONS getting up to.

So, if what you were expecting from this article is the usual shedload of graphs and figures demonstrating the flaws in the Indie’s analysis of the data it got from the ONS then you’re going to be disappointed because, after looking closely at the data and running a few numbers from what little there is than can actually be worked with it is entirely apparent that it is simply not possible to extrapolate anything that might remotely resemble a valid assessment of the male-female sex ratios at birth for any of the populations or population sub groups for which data can be extracted from the dataset without putting in several weeks, if not months of detailed research just to try and unpick the numerous potential sources of bias and confounding in the dataset and arrive at any kind of reasonable set of statistical controls that might allow me to correct the data for any of these potential sources of error.

What we have here is nothing more than a classic case of garbage in producing, in the case of the Indie’s article, garbage out but one thing that I can say with near absolute certainty here is that, based on this dataset, the Indie’s assertion that there are only two plausible explanations for the anomalies it thinks it’s found and that, in several cases there are anomalies that are only explicable in terms of sex selective abortion is complete and utter rubbish. In fact, easily the most puzzling statement in the entire article is this one in which makes claims about anomalies being found is sex ratios that apparently persisted across all family sizes:

The latter phenomenon might explain most of the gender imbalances we observed in two-child families, said Christoforos Anagnostopoulos, a lecturer in statistics at Imperial College London. However, it could not explain some sex-ratio anomalies that persisted across families of all sizes, notably for mothers who were born in Pakistan, Bangladesh and Afghanistan.

But, as I’ve just pointed out, it is simply no possible, using the Indie’s dataset, to calculate anything approaching an accurate male-female sex ratio for dependent children in families of any size other than those with two children because the breakdowns by sex for other families sizes in the data set are either entirely missing or incomplete and include ambiguous data which means that it not possible to calculate the exact number of male and female children covered by those elements of the dataset.

Having said that, and for the avoidance of any doubt, I must stress as a matter of personal opinion that I think it highly unlikely, if not impossible, that the professional statistician named in the first sentence of that paragraph is responsible for the wholly inaccurate assertions made in the second sentence which I strongly suspect can only be either a reflection of the personal opinion of the journalist whose byline is on the article or an editorial line imposed by the newspaper itself.

There is one final matter to pick up here, which I personally find rather troubling and it relates directly to an aspect of data set obtained by the Indie that I haven’t yet discussed in any depth, this being the breakdowns that the newspaper requested from the ONS by the region or country of birth for male and female parent resident in the households with dependent children.

The complete list of individual regional and national breakdowns in the dataset covers the follow birthplaces; the UK, the Rest of Europe, China, South Korea, the Rest of East Asia, Afghanistan, Pakistan, India, Nepal, Bangladesh, Sri Lanka, the Americas and Caribbean and the Rest of the World.

So from that list it quite obvious that the Indie has not just gone on a general fishing expedition but has actually gone out of its way to target first generation migrants from a number of specific East and South Asian countries, a list that does indeed include countries where sex selective abortion is known to be practised and where it is clear having an impact on the male-female sex ratio at birth; both India and China have male-female sex ratios at birth of around 1,120 males per 1,000 females which is, of course, well above the 1030-1080 males per 1,000 females range which is generally seen as being the natural range for humans. However, the same issue are not evident in the same ratios for the other countries of birth for which specific data was requested, despite their geographical proximity either India, China or both. The current male-female sex ratio at birth in Afghanistan and Pakistan is around 1,050 males per 1,000 females, which is pretty much the same as the UK. for Bangladesh, Sri Lanka and Nepal it is currently around 1,040 males per 1,000 females and only South Korea has a higher male-female ratio of 1,070 male to 1,000 females (source: CIA World Factbook).

Now in the Indie’s article India and China do get a mention but in the case of China this is only in the context of the male female sex ratio at birth and the practice of sex selective abortion in that country:

Abortions based solely on gender are illegal in Britain and in many other countries, even those where the practice is widespread. In parts of India and China there are now as many as 120 or 140 boys for every 100 girls despite a ban on sex-selective abortion.

India does get a second mention, alongside Nepal, this time in terms of ‘evidence’ gleaned from the ONS data set but we’re told only that there could be a problem but that there is insufficient data to be sure:

There is also some statistical evidence to suggest that gender-based abortions may also be occurring among women living in England and Wales who were born in India and Nepal – although there is insufficient data to confirm this effect, he said.

That’s a slightly odd way of putting it, given that families with dependant children and at least one parent who is a first generation migrant from India provide either the third or fourth largest set of regional/national data in the whole dataset, behind the UK, Pakistan and, where it is the female parent who is a migrant, the Rest of Europe but that may just come down a quirk in the statistical analysis carried out on the data.

However, the three countries that article specifically points to and which most would assume, from the rather lurid headline claim about anything from 1,500 to 4,700 ‘lost girls’, to account for the majority of that figure are Afghanistan, Pakistan and Bangladesh, none of which are countries where sex selective abortion is thought to be practised to any particular extent. These are after all, and we have already seen, the three countries where the article half claims and half implies that there are sex-ratio anomalies that persist across all family sizes that cannot plausible be explained by anything other than sex-selective abortion.

Now I think we’ve already established that that particular claim is utter nonsense based on the extensive array of confounding factors to which the dataset is potentially subject and the limitations of the data in terms of the male-female breakdowns it contains, which mean that we cannot make any kind of reliable estimates of the sex ratios across all family sizes but, on top of that, in the specific case of Pakistan and, perhaps to a lesser extent Bangladesh, we have a very large elephant sitting in the middle of the room in the shape our unknown numbers of young women and men, including minors, leaving the UK each year to take part in arranged, or in some cases forced, marriages.

But what about Afghanistan?

Well, here’s were we run into a very troubling issue. If we look at just the data for families with two dependent children then the data for families with male and/or female parents who are first generation Afghan migrants look very similar in most respects to the data for families with male and/or female parents who are first generation Chinese migrants. The numbers of families with two dependent children are very similar where the male parent is a migrant (4205 for Afghanistan, 4228 for China) although there’s a more substantial difference when it cones to those migrant female parents (3957 for Afghanistan, 6496 for China) but otherwise, once you get into calculating a few male-female ratios there really isn’t too much to choose between the two. For example, the overall male-female ratio for all families with two dependants and a first generation male Afghan parent (excluding same-sex twins) is 1,135 males to 1,000 females while for those with a first generation Chinese Afghan parent it’s 1.132 males to 1,000 females – and both are much higher that the same ratio for UK born male parents (1,044 male t0 1,000 female) and for all families with two dependent children (1,048 male to 1,000 female) and these similarities persist across pretty much every male-female ratio you can extract from the data for families with two dependent children give or take the fact that Afghani families appear to be a little more likely to have two boys than Chinese ones, while the Chinese families in the dataset are more likely to have had a female child followed by a male child.

In fact, the three populations with the highest proportion of families with two children, one male and one female, in which the female child is the oldest of the two are those which originated in Nepal (although the total number of families is too small to provide reliable sex ratio estimates), India and China.

Does that not strike you as odd?

Why is the Indie specifically pointing a finger at first generation Afghan migrant families but not at those with a parent born in China when the ‘evidence’ of sex ratio anomalies in both group is very similar, and in some case even stronger from the Chinese population – and I do mean ‘stronger’ only in the sense of how it looks on paper if, like the Indie and the journalist who wrote up this abominations, you don’t realise that the data itself is a complete pile of crap?

One hesitates to point this out explicitly but I dare say that a few of you have probably noticed by now that the three migrant population groups that the Indie is most clearly pointing to in this article, those with at least one parent born in either Pakistan, Bangladesh and Afghanistan, all have one obvious thing in common which also differentiates them from the other countries for which specific information was obtained; they are all countries with a predominantly Muslim population.

Perhaps the best thing to say at this juncture is that whatever you think that might or might not indicate I couldn’t possibly comment.

However, even leaving that to one side and to your own counsel, it is pretty obvious here that the Indie’s entire article stems from what has to be easily the most spectacularly misconceived and shoddily executed fishing expedition since Captain Ahab went off chasing Moby Dick – and yes, I know that’s fiction, but then so is Indie’s entire fucking analysis – so there are, perhaps, quite a few questions here that the Indie and its Science Editor, Steve Connor, probably need to be answering if they’re to scrape up any remaining shreds of credibility the might think they have in the wake of this entire sorry exercise.

To be brutally frank, I am genuinely at a loss to understand exactly what the Independent thinks its playing at here as the only halfway plausible explanation I can think of for any of this is that someone at the paper has run across Dubuc and Coleman’s study, failed miserably to understand any of it, least of all the entire methodology section, and then took an utterly half-arsed punt and trying to pull off their own similar analysis without realising that they hadn’t got the first fucking clue what they were doing. I’m really not even sure that the Dunning-Kruger effect is enough to cover the staggering level of incompetence behind the Indie’s article.

And on that bombshell…