The Hunt for Open Data in China

Ed note. This item originally appeared in Tech President and it reprinted with permission.

By Rebecca Chao

Like water and oil, ‘open data’ and ‘China’ take a bit of engineering if you want them to mix. Stories like those of human rights advocate Xu Zhiyong, arrested for rallying citizens to demand public disclosure of their officials’ wealth, are more the norm.

But rather than ask for information, a group of young techies are going out and finding it, despite the challenges in its use and the risks of digging too deep.

It is nearly midnight in Beijing when I finally connect with Cui Anyong on Skype but he chatters energetically about open data and how he thinks it is a concept that is really starting to pick up in China. Cui recently graduated from Hong Kong University where he studied communications and new media and coordinated a crowdsourced Chinese translation of the Data Journalism Handbook, created by the European Journalism Center and the Open Knowledge Foundation. He also helped organize China’s first open data journalism event held at the beginning of September, an event that sought to “build the future of news and civic information.” It sprouted from the Beijing chapter ofHacks/Hackers, a U.S.-based initiative to bring the media and technology sectors together in informal meet-ups. It now has chapters all over the world, including Latin America, Europe and Australia.

“People always say there is no open data in China,” said Cui. “Actually, there is open data.”

The government publishes water-air quality and earthquake data in real time, roughly every four hours, explains Cui. The three main open data sites are Beijing DataData Shanghai and the National Bureau of Statistics. Cui admits that the data available is still “very limited,” but he is hopeful “it is the beginning” of something more.

Despite the limits on open data, with some ingenuity, Cui is able to put the data to good use.

“Have you heard of cancer villages in China?” Cui asks. I nod.

In some regions of China, like the small rural community of Xinlong that sits next to a large industrial center in Yunnan province, the water some times runs red and yellow, crops turn black and an alarming number of villagers have some form of cancer or other serious health problems. It is an issue that the Chinese government, particularly at the local level, has turned a blind eye. In February, when the Ministry of Environmental Protection used the term “cancer village” in a report, the minister was later reprimanded for “making a mistake.” Officials soon sent out memos to provincial governments, advising them to curb the use of the word, “cancer village.”

Cui is using open data culled from the Chinese government’s websites to see if he can detect any patterns between water quality and health. He aims to use the data sets to create and overlay two maps so that he can visualize and draw links between health and water quality, like cancer villages and polluted water sources. He hopes to finish his online app by next month.


A screenshot of Cui’s water map, still in its initial stages.
Others who want to work with open data turn to non-governmental sources.

“The development of the Internet industry is really good in China so lots of people scrape data from Weibo and the news websites,” says Cui. “You can find lots of articles talking about how to scrape or how to visualize the information flow on Weibo.”

One site, created by a group of doctoral students maps earthquakes and the H7N9 bird flu. In May 2009, reporter Deng Fei created a cancer map via Google Maps. It has nearly 50 locations but according to some reports, there are as many as 400 cancer villages in China. Fei also launched aWeibo campaign last winter, asking users to load photos of pollution. It went viral.

Danger Maps, which has attracted much attention and received support from Alibaba Group, considered the e-Bay of China, uses a combination of government data and crowdsourcing. TechPresident previously wrote about the app. It plots the location of industrial facilities like power plants and toxic-waste treatment centers and allows users to search for polluting factories near their homes.

The proliferation of environmental maps is partially because the environment has been one area where the government has slowly opened a small hole for debate and criticism and sometimes even outright protests, such as the one in Ningbo in October 2012; it drew three thousand who rallied against the expansion of a chemical plant. The government later agreed to stop the expansion.

Liu Yan is the creater of a hackerspace called Xindanwei, meaning ‘new work unit,’ which is a play on the government work units. She recently helped to coordinate a government-led climate change hackathon-type event. Liu refrained from calling it a hackathon because that term is not very popular with the Chinese government. The event was sponsored by the U.K. government with support from the Swiss and Chinese.

“When I heard about this project I was very excited,” said Liu. “This is the first time that the government is providing all this data to the start-up and creative community and is working together with them by providing data sets. Also, top researchers from all over China are providing insights and knowledge. I was super excited because for start-ups, this is really a very important and unique opportunity to be connected to these data sets.”

This was also the first time that academic researchers were stepping out of their ivory tower to connect with non-academics and co-create something, “rather than providing you with data and good luck with it,” explains Sophia.

Opening Open Data

Even when data is available, and when the government chooses not to shut down these initiatives, the data may sit there like canned food, requiring a special utensil to open the data or turn it into a readable format.

“The situation here is very different from the experience in the U.S.,” says designer and researcherClément Renaud. “In the U.S., it is a very government-driven project. In China that does not work at all.” Renaud is originally from Lyon, France but is currently in Shanghai completing a Ph.D in media and communications, studying the way “online social practices are shaping a forthcoming economy of sharing.” He and a community of Chinese bloggers independently created a research center they call the Sharism Lab.

In China, open data is not promoted by the government explains Renaud. “You really have to dig for it.” Since the country is so large, data is also sometimes “too big” with no “ granularity.” Information on each of China’s province and its cities “have lots of holes,” says Renaud.

Another issue is that data is simply not in a readable format. Chinese websites put data up in JPEG or other image formats, making data difficult to cull or ‘scrape.’

Bu Shujian, originally from Yangzhou in Jiangsu Province (famed for its fried rice, she tells me), now works as a data design analyst in San Francisco. In 2011, as part of a thesis project at Hong Kong University, she took pollution reports from the U.S. Embassy in Beijing and compared them with those released by the Chinese government. But she was unable to scrape the data from the Chinese government’s website.

“It was impossible,” she explained. But rather than throw up her hands, she consulted a colleague who wrote a script to capture images of the online data every couple of hours and convert these images into data. “I’m not saying people don’t want to do it,” says Bu. “It’s just not data friendly.”

Her hard work paid off, however, as she made some surprising discoveries.

“The results show that the [Chinese] government data is pretty accurate,” Bu said, which ran counter to what she had expected. She believes the hype around China’s pollution ratings could have been spurred by the media, who often pick up on the time stamps that reveal particularly large differences between what is reported by the U.S. embassy and the Chinese government. The difference can be explained by different methods of collecting data, says Bu. The U.S. Embassy collects data only from a few points in Beijing while the Chinese government collects from over 27 locations. Since the city is so big, collecting from the most polluted areas could lead to skewed results.


A screenshot of Bu’s pollution charts.
Still, Chinese citizens have long mistrusted pollution data, using the brown, exhaust-filled air as proof that the government underreports air quality levels.

This mistrust comes in part from the government’s efforts to close down competing sources of information. In June of 2012, China asked other governments not to release their air quality data, an action that experts believe was specifically directed at the U.S. Embassy.

Renaud explains that indeed, most of the data in China is difficult to corroborate. “There’s no way to check it,” said Renaud. As a result, he’s been delving more into hardware or creating tools to produce data. “How can we design very cheap hardware, like censors and data fields to produce data? That’s why there has been a push for open source hardware, to create devices that will generate data.”

Over in Shanghai, developers and designers gather regularly at the Make+ studio to create open source hardware, like an air quality reader and a cheap air filter that can easily be assembled with materials purchased from the e-commerce giant, Taobao. Air filters in China normally cost between US$600 to $1,500.

Sophia Lin is the creator of Make+, which sprouted like a nesting doll from within another hackerspace created by developer David Li of Xin Che Jian, the first hackerspace in China. Lin is an artist but became involved in technology through new media art. While interacting with members of Xin Che Jian, she decided to create a space of her own to bring together artists and technologists.

“I thought, how about if I get some artists to collaborate with technologists who may not know much about art,” explained Lin. “Maybe some innovative ideas could come out of it.”

Renaud is working on a project he calls the Ether Mashup that can grab data from wifi and turn it into a film. It will exhibit at Shanghai’s Smart City Biennale in late September. While Renaud explains how easy it is to capture data in China because of a lack of privacy laws, the project is also about the dangers of surveillance.

“Wifi is not encrypted. Everyone can look at wifi,” Renaud explained. “We built a very small device that can spy on wifi. You put it in a room where several people are online. You can look at everything they are looking at, broadcast it and turn it into a movie.”

Renaud explains further, “The point of this work is that it is so easy to do this. It’s about spying on anything. You don’t need to have a complicated tech background.”

Staying Within the Lines

While privacy issues are unfortunately less of a concern for those handling data in China, those using data do have to worry about crossing the blurred line between acceptable and unacceptable data use.

As I wrote in TechPresident previously, websites that step too much on the toes of the government are quickly shut down.

The other catch is that data, if used to challenge officials, can alert the government and make them wary of releasing more data.

China is well known for a phenomenon called the “human flesh search” where netizens comb the Internet for information on corrupt officials, such as how many properties they own or pictures of them with expensive cars and watches. The government has subsequently tightened publicly available information, particularly at the local level.

Cui also explained, in regards to his project, that he too must tread carefully. “Water quality is very sensitive in China because the coasts bring in a lot of development for local governments.”

In terms of the future of open data, Cui is quite optimistic because of the deep interest in its use by civil society, particularly journalists.

The hacks and hackers meeting he organized in early September, to introduce journalists to developers and the concept of open data, had a tremendous turnout. Cui estimated a crowd of 30 but over 100 turned up, with a balanced mix of journalists, developers, data techies and designers.

The event was held in the Zhongguancun neighborhood of Beijing, often referred to as China’s ‘Silicon Valley,’ which is surrounded by a triangle of the country’s most prestigious universities: Peking, Tsinghua and Renmin. Tsinghua also has a well-regarded journalism school.

“China’s media is highly interested in data journalism,” said Cui. “They started seeking the help of technologists but they have no idea about what open data means so we needed to tell them how data works for journalism.”

On the other hand, says Cui, the Chinese developers generally are more rigid and have “practical” skills since they work for big Internet companies. “They are willing to do something for the public for society. They just need to know how they can do it and what is useful.”

Bu shares that sentiment. “The Chinese community is getting really interested in data journalism,” Bu says a number of times during our conversation. While Bu is currently working in data visualization and design, she also has a background in journalism and has worked for The Wall Street Journal, among others. Bu is part of a team that received a Magic Grant from the Brown Institute for Media Innovation. Her team will build a tool that tracks news websites in authoritarian countries to see what information is censored.

At least with young journalists and civic hackers enthusiasts like Cui and Bu, there is a sense of longing to emulate the digital media works of those in the U.S. Both mentioned their hope to see data journalism in China reach a level comparable to western media. “Snow Fall is very popular here in China,” says Bu of The New York Time’s Pulitzer-prize winning interactive reporting piece.

When it comes to innovation, however, it is an easier route for technologists than it is for journalists. When I asked if these pioneering data journalists would venture beyond environmental issues and dig into the financial records of high level corrupt officials, Cui exclaimed, “I would not go to that length. It is still too dangerous.”