China's Data - PingWest

Kai-fu Lee, the chairman of Sinovation Ventures, has been making the media rounds with the publication of his new book on AI, China, and the US. In an interview with the Washington Post he invoked on a well-worn analogy, saying that the role of data in AI is “like [that] of oil in powering an industrial economy.”

It’s been an oft-repeated claim, a pithy way of asserting a pivotal change in the world economy: data is the new oil, the digital fuel to power a digital economy. And in this analogy, China looms large, as it is presumed to have a trove of data that other countries can only envy. Lee did immediately go on to qualify that not all data is the same, but that only undercuts the analogy, and if the point is simply that data is valuable to business, then a great deal risks getting overlooked. Because data isn’t oil, and China’s data isn’t necessarily what one might think.

It is true that China has a superabundance of data. As with so much else, the size of the country amplifies most statistics into superlatives. China has more than 800 million internet users. WeChat has just over a billion accounts. Alibaba ships 10 million orders per day, and ecommerce accounted for 20% of all retail shopping in China as of last year. Users of the short video app Tik Tok spend an average of more than 50 minutes a day in the app. A report by Ericsson last year projected that as of 2023 China’s mobile broadband usage would amount to 15 exabytes per month, or 180 billion gigabytes per year. And in just two years’ time, the president of the Chinese Academy of Sciences estimates that China will account for fully 20% of the world’s data.

It’s not just the quantity of data, though. Because of the way that China’s tech ecosystem is cloistered, there may be more opportunity for creating what Lee refers to as depth of data. A news aggregation service will collect data on what news stories users follow, while an ecommerce platform will know what they buy online; both of those datasets can be mined separately, but if joined together can sometimes reveal further insights, user behaviors on one side contextualizing and predicting those on the other. China’s tech leaders, like Tencent and Alibaba, have the ubiquity and the alliances to be able to pull together just such disparate datasets and build far deeper profiles of any given individual’s online behavior. (Although this might be a moot point, as recent privacy and data protection laws are actually designed to prevent just that and should, in theory, keep data siloed).

Data Isn’t Oil

But the problem isn’t the quantity of data at China’s disposal, or even how integrated it might be. It’s that data just isn’t oil. Petroleum industry experts might quibble over the subtleties of differing quality, viscosity, impurities and so forth, but one barrel of crude oil is basically the same as the next. An oil barrel from Texas yields more or less the same products as one from Saudi Arabia, with buyers for those products to be found most anywhere.

Data is rarely so interchangeable. For instance, Alibaba’s untold petabytes of data on ecommerce transactions are probably mostly only good for analyzing, well, Alibaba’s ecommerce transactions. The data cannot be used to optimize city traffic, or aid medical imaging, or refine news aggregation engines, no matter how many CNNs or GANs you might wring it through. Data is always inherently about something, a measure or classification of some fact, whether it’s how much someone spent on movie tickets in a month, whether a pair of shoes they bought was black or brown, or how much time they spent reading an article. Data isn’t a generic, fungible substance, and its specificity both defines and constrains its usefulness.

But worse, data and the models built on it can be out of context in unexpected ways.

The last few years have seen plenty of cases where seemingly magical algorithms return ugly and inaccurate results because of unseen cultural biases in the original data. Many speech recognition algorithms, for instance, have proven less accurate at understanding women because their training data consists mainly of male voices. Facial recognition algorithms have suffered failures when shown faces of Asian or African people but the training data was skewed towards people of European descent. And China is not exempt from such problems. Last October, WeChat faced public embarrassment when it was discovered that the app translated the Chinese expression 黑老外, a relatively neutral term for “black foreigner,” to the n-word in English. It didn’t do so in every case, just when the context was negative, such as when the person in question was described as “late” or “lazy.”

Biases in data are rarely intentional, and usually go unnoticed until such glaring failures occur. But there can be many more subtle flaws lurking in data, or even more ambiguous faults that only arise when AI models are applied to unexpectedly different contexts. Would data on the grocery shopping decisions and preferences of Chinese consumers help to understand those of consumers in India? Would a fintech company’s data on Chinese spending and saving patterns be useful in Australia? Problems can be more subtle still: Google, in collecting hand-drawn images from people around the world, discovered basic cultural variations in how people typically envisioned trees or houses, for instance. For all the data that China may possess, its value may be largely limited to China. Chinese firms that hope to operate abroad will need new and localized data to retrain their models, or at least to confirm they are still applicable.

No Guarantees

Even when data is used in an appropriate context, it can still come to nothing.

One of the most deceptively complex decisions for researchers and data scientists is just deciding what data to collect in the first place and how to measure it. Virtually every company wants to gather data on its users or customers—but what should they ask? Whether a user is male or female? Their age? What city they live in? Those are the obvious and conventional variables, but they may not be what really matters most.

The American dating site OkCupid famously trawled hundreds of millions of user responses to nearly 300,000 distinct questions, and found that the best questions for predicting whether two people had potential as a longterm couple had nothing to do with values, personal backgrounds, or whether someone liked long walks on a beach, but instead concerned: whether or not partners both liked horror movies or not, had traveled abroad alone before, and would be willing to give up everything to live on a sailboat.

As practicing scientists well know, all too often data gets collected just because it’s readily obtainable, and the variables or data sources that might have the most “explanatory power” get passed over, either because they are too difficult to measure or no one even thought of them—it’s what’s known as the lamppost problem, and it affects machine learning and big data just as surely as it does scientific research at large. But data scientists also know that sometimes a large dataset can be subjected to every possible machine learning technique and turn up no worthwhile results. The equivalent here is not drilling for oil and coming up dry, but drilling for oil, filling a million barrels, and only then finding out whether or not it might even be usable.

Chinese companies may very well have troves of data that could prove invaluable (for their own purposes), but it depends very much on exactly how the data is collected and structured. Even then, there is no way to predict its real value before it has been analyzed.

For All the Data in China

There are still other problems. Not all data may be reliable, or even real (click farms and the like can muddy the metrics of any app). Advances in AI are assumed to depend on more and better data, but more sophisticated algorithms of the future may require less data to train. And even when more training data means enhanced accuracy, there are still diminishing returns, and at some point an additional data point or million will add nothing.

This isn’t to argue that data isn’t valuable. It is an absolute necessity for many emerging technologies, and China has plenty of it to bank on. But calling data the oil of the 21st century gives a misleading sense of its use and potential. There is little sense in calculating a dollar value for a terabyte; to know the real value of any piece of data, difficult and detailed questions will need to be asked about what it consists of, how it was first measured, and how and where it will be used.

There may be incredible value in China’s data, but whatever it may be is yet to be seen.