How Big Data Can Put War Criminals Behind Bars

Ed note. This post originally appeared in Tech President and is reprinted with permission

By Federico Guerrini

On September 20, 2013, in Guatemala, the former director of the National Police of Guatemala, Col. Héctor Bol de la Cruz, and his subordinate Jorge Alberto Gómez López were convicted for the abduction and presumed murder of student and labor leader Edgar Fernando García, who disappeared in 1984, during the conflict that devastated the South American country between 1960 and 1996. Three years earlier, two lower ranking officers were also convicted for the crime.

The convictions were made possible thanks to the work of the Human Rights Data Analysis Group, a San Francisco-based nonprofit that uses statistical analysis to support the cause of human rights. “In 2010, documents in the national police archive were discovered that linked police officers at the time to his disappearance,” HRDAG’s director of research Megan Price tells techPresident. “The defense said they were fabricated but we showed, with a statistical analysis that they shared many attributes with the other documents in the archives,” meaning that they were almost certainly authentic.

HRDAG’s analysis also made clear that the files related to the disappearance of García showed a higher level of communication with higher ranking officers than the average document in the archive. “There was this real evidence of a pattern –that higher ranking officers within the police structure were aware of García’s disappearance,” Price explains.

That’s why, when in 2010 the judges came back with their verdict and sentenced the two lower ranking police officers to 40 years in jail, they also asked the attorneys to investigate further. Three years later, de La Cruz and López were convicted as well.

The murder of García is not the only case that HRDAG has made a significant impact. The group’s history dates back to at least 1991, when HRDAG’s executive director, Patrick Ball, while working for an NGO in El Salvador, started to build two databases that listed victims and authors of the violence that for the previous 12 years had devastated the country. One database contained witness testimonies of violence and abuse. The other contained the career history of the 400 most senior Salvadoran military officers – what jobs they held, which unit they commanded, and when. Combining these two, and using also another database provided by the Los Angeles-based NGO El Rescate, made it possible to identify the 100 officers involved in the worst human rights violations.

Ball continued in the field of human rights, working in Guatemala for the American Association for the Advancement of Science (AAAS) and using a technique called Multiple Systems Estimation (MSE) to estimate the probable number of killings in the region. He came up with 132,174 killings (the estimate only took into account the murders that took place between 1978 and 1996, as there wasn’t enough data to calculate the earlier years), a much higher figure than the one that could have been obtained by simply enumerating the death recorded by official sources. Ball’s analysis also highlighted that the violence was directed especially at a certain ethnic group: Guatemala’s indigenous Mayans. This result — together with other evidence — was the foundation for the May 2013 verdict by a Guatemalan court that former head of state José Efraín Ríos Montt (under whose presidency, in 1982-1983, the persecution of Mayans was especially fierce), guilty of genocide.

The precondition for using MSE is having multiple databases that document the same thing (for instance, lists of killings provided by different NGOs). Then you identify the duplicates — names that appear in more than one list — and from the number of “overlaps” you are able to provide estimates.

To understand the process, take fishing as an example. If you cast the rod several times in a pond, and you often catch the same fish, it’s probably because there are not too many fish in the water. On the contrary, if you never catch the same fish, you can reasonably assume, that the overall number is quite high.

You also have to consider what’s called “selection bias” due to the fact that the numbers are not collected randomly. “In this kind of approach, you often rely on lists collected by different groups, but each group is going to collect a limited set of cases, whether it’s by region or by another criteria,“ Jay Aronson tells techPresident. He is the founder of Carnegie Mellon’s Center for Human Rights Science, which is currently partnering with HRDAG to improve mass casualty estimation in Syria. He explains, “Let’s say you have an NGO that works in a particular governorate in Syria. They’re not going to collect a lot of cases from a governorate that’s far away. Or in Latin America, for instance, the Church or Catholic NGOs might collect data. They might not explicitly be collecting only cases of Catholic people, but that might be the bias of their data. Many non-Catholics may avoid going to the Catholic Church, or they may just go to a different group because that’s what they’re familiar with.”

The HRDAG also applied its statistical skills to the conflict in Kosovo, where a policy of ethnic cleansing had been enacted by Yugoslav forces against Kosovar Albanians. The Serbian government led by Slobodan Milošević claimed that, on the contrary, the cause of the mass migration of refugees from Kosovo to Albania were the NATO bombings and the military actions undertaken by the Kosovo Liberation Army.

After the humanitarian crisis was over, HRDAG filed an expert report that was used in the trial of Milošević at the ICTY (International Criminal Tribunal for the Former Yugoslavia) in the Hague, for charges of war crimes.

The report examined the claims by the Yugoslav government and found that they were inconsistent with the data collected in the field. Examining Kosovo’s border records, exhumation data and other information, Ball was able to establish that the timing of the bombings and that of the flight of refugees simply didn’t match.

The most likely explanation for the migration, the only hypothesis supported by all the data, was that Serbian authorities had planned and implemented a centrally organized campaign to ethnically cleanse certain regions of Albanians, and that’s what Ball testified in court. Unfortunately –- for the trial’s sake — Milošević committed suicide before the verdict was out.

The group worked in a myriad of other places. In Timor-Leste, the organization advised the Commission for Reception, Truth and Reconciliation in East Timor (CAVR) and was tasked with investigating human rights abuses committed during the 24 years (1975-1999) of Indonesian occupation of the island. In Chad, they helped Human Rights Watch and other NGOs uncover the political violence of former president Hissène Habré’s regime. They also worked in Perù, the Congo, and Sierra Leone. Until June 2013, the team worked for the U.N. preparing a study of the number of reported killings in Syria from the beginning of the conflict in March 2011 through April of last year. The analysis was based on records from eight data sources and resulted in an enumeration of at least 92,901 reported killings.

During the years, the team also acquired a more precise identity. The name Human Rights Data Analysis Group was used for the first time in 2002, in a grant request to the John D. and Catherine T. MacArthur Foundation. “Then, from 2003 to 2013 we were part of Benetech, a non-profit technology company and since last year, we tried to spread our own little nonprofit wings,” says Price.

Working with data is a sensitive task: your conclusions, though scientifically correct, might be used in the wrong way, intentionally or unintentionally.

This helps to explain why last year Ball, while the HRDAG was working for the UN to provide an account of the casualties in Syria, turned down a journalist’s request to access to the “raw data” of the investigation, in order to create a map.

The raw data, Ball said, could be misleading, as it said nothing about the underlying pattern of deaths, such as accuracy of reporting versus actual deaths. You could have 100 deaths on Thursday, 120 on Friday, and 80 on Saturday and suppose from this you assumed a peak on Friday, just because the observed number was the highest. Whereas the real explanation for the result could have simply been that, due to various reasons unrelated to the number of actual deaths: maybe the team worked particularly well on reporting atrocities or maybe on Saturday some researchers were ill.

It’s easy to draw the wrong conclusions from seemingly unequivocal data, even if you’re in good faith. Imagine what can happen when the figures are falsified on purpose. “Like many tools employed in international and domestic politics,” Kelly M. Greenhill, author of Sex, Drugs and Body Counts tells techPresident, “statistics and data analyses are double-edged swords. They can be invaluable in helping policymakers make hard choices. They can be critical aids in uncovering truth during both conflict and post-conflict periods, and in defending truth against future revisionism. Unfortunately, these very virtues make them ripe for manipulation and exploitation.”

In her book, Greenhill offers a number of examples. In Bosnia, the number of Kosovar Albanians displaced or missing varied, in politicians’ statements, from 250,000 to 100,000 to 10,000, according to the needs of diplomacy (the more deaths you had, the easier it was to justify Nato’s intervention). Similarly, during the Rwanda genocide of 1996, the number of refugees that had disappeared, 700,000 or 500,000 or 200,00 was hotly debated and seemed to sway depending on a particular country’s geopolitical goals. And these are just two of many cases. “It’s not even that data can be misused,” Aronson, who also co-authored a book called Counting Civilian Casualties, notes. “It’s that there’s so much uncertainty in most data that you need to be very careful about how you produce the data and how you analyze it and how, as a scientist you present your findings to the public.”

This is something people like those at the HRDAG are well aware of. “From our perspective,” Price says, “the solution to that is both to stay very close to the data, to be very conservative in your interpretation of it and to be very clear about where the data came from, how it was collected, what its limitations might be, and to a certain extent to be skeptical about it, to ask yourself questions like, ‘What is missing from this data?’ and ‘How might that missing information change these conclusions that I’m trying to draw?’”

It’s an extraordinarily difficult job. But the rewards are also immeasurable. “One of my favorite photographs is a wonderful photo of labor leader Edgar Fernando García’s daughter from the 2010’s trial, right after the verdict was read,” says Price. At last, the young woman, who was an infant when her father disappeared in 1984, could see his torturers brought to justice and see them pay for their actions. “I feel like that’s why we do this work,” Price concludes. “To provide some sort of closure to individuals like that.”

Federico Guerrini is an Italian journalist. He covers technology for a variety of publications including ZDnet, La Stampa,, l’Espresso, and il Corriere della Sera, among others. He blogs at and tweets as @fede_guerrini

Editor’s Note: The article has been revised to clarify the sentence that noted the controversy over whether 700,000 Rwandans had disappeared.

Photo credit: war graves in Kosovo nh53/flickr

Personal Democracy Media is grateful to the Omidyar Network and the UN Foundation for their generous support of techPresident’s WeGov section.