Towards a Scientific Study of Sexual Violence

Mar 10, 2026

In Britain, thousands of girls have been sexually abused by organised groups of men. Vulnerable girls were groomed with gifts and affection, isolated from their friends and family, then plied with alcohol and drugs, bullied, raped, and trafficked as part of large criminal networks. Yet when victims, journalists and researchers sought help or pressed charges, they were dismissed and derided.

In Baroness Casey’s National Audit on Group-based Child Sexual Exploitation and Abuse (CSE) (2025), she documents chronic failures of policing, social services, local councils, and political leadership. Police and local governments repeatedly denied abuse; dismissed victims; suppressed whistleblowers; failed to record basic offender characteristics; cited fear of community tensions and accusations of racism as reasons for inaction; and resisted external scrutiny by obstructing public access to data.

Absent a full inquiry, the true scale of abuse remains unknown. Where suspect ethnicity has been recorded, such as in Greater Manchester, South Asians are over-represented relative to their share in the local population. But for the vast majority of cases, the police do not report ethnicity. The lack of systematic data means that political discourse becomes marred by anecdotes, speculation, and aggressive vitriol.

Below, I present a pathway forward, showing how standard methods can answer the most important questions. Any government committed to gender equality and women’s safety will doubtless provide full support.

Three actors from the BBC drama Three Girls, which focuses on the Rochdale grooming scandal — BBC’s ‘Three Girls’ - a three-part drama based on real abuse at Rochdale. Nine men were convicted for grooming girls (aged 13-15), plying them with alcohol, passing them between men, who raped them for cash, over several years.

Estimating ethnicity when records are incomplete

The lack of data is not insurmountable. Economists studying immigration in the United States routinely infer ethnic origins from names linked to large population datasets. Ran Abramitzky and Leah Boustan use historical census data and surnames to study how Jewish and other European immigrants assimilated over the twentieth century. Because names carry persistent signals of cultural origin, researchers can identify group-level patterns even when ethnicity itself was never directly recorded.

Muhammad Usama Polani (my collaborator) and I suggest applying a similar approach to British data on group-based child sexual exploitation.

The 2011 Census for England and Wales contains around 56 million records linking names with self-reported ethnicity. Kandt and Longley (2018) developed an ‘Ethnicity Estimator’ - the probability that particular names belong to different ethnic groups. Such models cannot determine an individual’s ethnicity, but if given 1000 names, they could say what percentage are probably Hispanic.

This approach has already been used to fill gaps for English hospital admissions. Kandt and colleagues successfully used name-based predictions, validating against records where ethnicity was recorded.

Geography further improves accuracy, as shown by Elliott and colleagues (2009). Name-based probabilities are more informative when combined with data on where someone lives and the ethnic composition of that neighbourhood. Since Bradford is home to many British Pakistanis while Tower Hamlets has more British Bangladeshis, an “Ahmed” from Bradford is statistically more likely to be Pakistani than an “Ahmed” from Tower Hamlets. Subsequent work has provided improvements.

Suffixes - like ‘-ewski’ and ‘oğlu’ - also provide clues about cultural origins. Sood and Laohaprapanon (2018) trained such a model on millions of names, achieving around 85% accuracy in predicting ethnicity.

Analysing UK police data

These methods could be applied to the UK’s national police datasets on child sexual exploitation - including the earlier CPAI database and the more recent COCAD system. By linking suspect names to large population datasets, we can estimate the ethnic distribution of suspects, e.g. 25% Hispanic.

Some police forces do record ethnicity. These cases are tremendously valuable because they allow the probabilistic estimates to be validated and refined.

Historical reconstruction

This approach could also be extended historically. Although COCAD was created in 2023, it draws on police data that goes much further back. If the earlier CPAI records also contain suspect names, the same ethnicity classifiers could be used to those records.

Unfortunately, earlier cases were not classified consistently. But analysts have already read and labelled COCAD cases, probably adding explanatory notes. AI could learn from these examples and classify older CPAI case narratives in a consistent style. Human analysts could then validate the model’s accuracy before applying it across the full dataset.

Institutional failures and attrition

The 12-week COCAD snapshots suggest substantial attrition. Within this timeframe, only around 5% of suspects are charged or summoned, while roughly 30% of cases are closed because victims do not support prosecution and 23% because of insufficient evidence. Casey’s National Audit documents similar institutional failures.

All this can be studied systematically: which cases moved forward, and which stalled. Did prosecution rates vary across regions or ethnic groups?

If the government were to provide access to older CPAI records that could be reclassified and linked to outcomes, researchers could examine whether patterns changed over time, such as in response to greater scrutiny.

Court records & victims’ testimonies

Court records provide another rich source of evidence. Roughly 400 cases of group-based child sexual exploitation have gone to court in Britain. Their transcripts contain detailed evidence on offenders’ coordination and institutional responses.

Large language models could help extract structured information from these transcripts, identifying recurring patterns. Since the number of cases is manageable, the resulting dataset can be cross-checked by humans.

Survivors’ testimonies also remain paramount, alongside nationally representative surveys measuring attitudes towards gender norms, victim-blaming and whistleblowing.

Evidence and Accountability

Building a transparent national dataset is an essential step - towards evidence-based discussions, greater institutional accountability, and ultimately stronger protection for vulnerable children.

I write this in support of all victims, everywhere.