BIG DATA COMMERCE VS BIG DATA SCIENCE [1]

[CLICK HERE FOR TEMPORARY PRE-PUBLICATION VIDEO LINK] [2]

First, a little about big data in science: big data in science is really magical. It's really astonishing that we can take in data from sensors and from other sources and analyze very large amounts of it effectively. This is something amazing and some of the great gifts of the age of computation. Yet, even as it is magic, it's a hard kind of magic. It's not automatic. In order to understand why I say it's not automatic, one need only consider the familiar examples.

A great commonplace example of big data is weather. We can gather incredible amounts of weather from satellite, from ground sensors, from all sorts of sources. We can analyze the data, and the result of that is improved weather forecasting. And yet, forecasting is not perfect. It's better, but every bit of improvement is hard one. It's the hard kind of magic. Another crucial point is that despite there being large amounts of data, it took years for a consensus to merge that global climate change was an authentic phenomenon, and it is still somehow difficult to convey the scientific consensus to the public at large, because there's no single smoking gun. We face both a new scientific process that can take a long time to sort out, and also a new communication process with the general public that is still not sorted out.

I'll mention a few other examples of big data and science that I find personally exciting, and they all exhibit the same quality of being incredibly magical. I mean magic not in a supernatural sense, but just in the sense that it's so thrilling to have such a widened horizon in front of us.

Genomics is another fantastic example. We're in the age of genomics. We can gather huge amounts of data and work with it in ways that would have been unimaginable a few decades ago. But at the same time that doesn't mean that suddenly whole families of diseases have disappeared from the earth. The progress is slow, the progress is real, and I'm very optimistic that it will continue, but it's very hard to predict and say, well, just because we have this much human genome data that suddenly on a given year all these diseases will suddenly be gone from the earth. It doesn't work that way. There's real work.

My favorite example from the last couple years is probably from Jack Gallant's Lab, at UC Berkeley, where he was able to recover what people were seeing by directly observing activity in their brains by using big data and statistical methods. The nature of the experiment was actually very simple. He showed subjects a batch of movie clips and recorded functional magnetic resonant imagery of their brains, FMRI data, and then showed them new clips they'd never seen before, and simply correlated how similar their brain's responses to the new clip were to the various previously seen clips, and then differentially mixed in the previously seen clips. In sufficient numbers, what comes out is what the person is seeing.

It's a form of mindreading enabled by big data, something that would have been unimaginable only very recently. Furthermore, the way it came about was more a statistics exercise than anything. In fact, in one case students in a statistics class were able to enter a competition to tweak the algorithms to improve the results. This is fantastic, and yet, if you talk to Jack Gallant, or other people involved in the research, they will not say, "This is fantastic. We now understand how the brain encodes visual information." Far from it. What they'll say is, "This is fantastic, because now we have a toehold with which to start doing the real science. This is not the end, but a beginning, and now the real work comes."

There's this very strange thing that can happen with big data, where you can get this huge payoff early in the game prior to the scientific understanding. In this case, you have this amazing payoff, which is an ability to reconstruct what somebody's seen just by observing the blood flow in their brain, which is wild. And yet, that precedes the science. There's a crucial point here, which is to say that if anybody is making the argument that detecting statistical correlations can become a substitute for scientific understanding, they should look at all these cases and see that useful results still require understanding, which takes a long time to generate. I don't think this should be a controversial thing to say, and yet, there's some circles (usually not in the sciences but more in the periphery—more technologists and business tech types or tech journalists) who might promote the idea that some sort of correlative artificial intelligence thing can become a substitute for scientific understanding, and I don't think we're seeing evidence of that yet. We're seeing just the opposite, which as incredibly valuable as big data is, it's still a tool towards the end of understanding, and that's where we really gain utility. So that's a snapshot of big data in science.

Now, to move to what I feel should be held in a different category, which is big data in business. In terms of the instrumentation and the sorts of professions that might be drawn into the professional backgrounds, you see a lot of similarity in these two. If you go to a big data center at a physics experiment, or an astronomy experiment, or a genomics lab, you'll see the same kinds of computers and the same recent PhDs from MIT and CalTech, and there's a lot that looks the same, but there's some huge differences. There are differences in practice, in the epistemology, but let me just go through some of them.

One of the things is that when you gather data from people they don't sit there passively, as natural data does, waiting to be measured. They're not like the atmosphere or the genomes, sitting there. When you measure people in a business context, a significant percentage of them turn into scammers who are fighting back and trying to change the way they're measured for their own advantage. You're engaged in a world of game theory instead of measurement, which is fundamentally different.

In the sciences, of course, this isn't entirely unfamiliar. Anyone who does work in the social sciences or in psychology is used to having to trick their subjects into trying to avoid these sorts of biases, and there's a whole world discipline about that. But in business, it's quite a profound issue, which is not even addressed yet from a business perspective, much less a scientific one. There are so many examples.

Google has a huge problem with people gaming their system to get high rank on valueless websites simply because there's a profit motive. You have a whole world of marginal websites that maybe have a little bit of content but offer very little value that outpace the real websites because they really want to be high on the list of results. You also have completely bogus ones that offer nothing. You also have strange distortions, such as when internal memos at AOL and Huffington Post that directed writers to change the way they wrote in order to appeal to Google's algorithm. You have this whole world distorting itself, curving itself, to look good to the algorithm that sorts it. Natural enough, this is nothing new. This is what people do.

Google engineers have been kept very busy attempting to fight back, and the way they fight back is by tweaking the algorithm, and tweaking their policies, all these things. But the problem is that the population of scammers is adaptive and smart, and doesn't just sit there waiting. Instead, it fights back. There's this constant arms race that if Google works very hard, they can stay a little a head of, but they can never defeat 100 percent. There's always a taint there.

There's a world of other examples. I could mention the whole world of malware viruses. These are people fighting back against software developers. Sites that rely on reviews often have fake reviews. In fact, you can for a low cost buy fake reviews for your product on commerce sites that rely on reviews. Facebook has announced that (I think 87 million is the number, I don't remember) a nation's worth of fake accounts exist on Facebook. Its just goes on and on. Humans don't sit there passively, but they adapt if it's in their interest to. Therefore, the whole idea of big data on people is something that still has to evolve the practices that correspond to the small data practices that have always existed in the science of studying people. In the science of studying people he used double-blind experiments. He used deception and cognitive science experiments. He used placebos. There are all these techniques. These things don't really fully exist for big data yet; so, therefore, this is not really science.

An interesting question is: Given the prevalence of scamminess in network-based business, why are businesses still investing in their kind of big data? There are a couple of answers to that that are interesting. Part of it is that business is not the seeking of truth. Business is the seeking of a workable result, a profit. Because of this gray area between the entertainment world and technology that's arisen with the Internet, a lot of enormous amount of high-tech business has taken on the qualities of the entertainment world. A company like Google, one of the big tech companies that's from a business point of view functioning as an ad agency, and Facebook essentially, this upstart that's become big and gone public, competing over approximately the same customer base, you really have dynamics that are not so objective and so hard-edged as either the sciences or many of the types of industrial businesses that have been associated with engineering and science in the past. They're much more like the entertainment and advertising businesses, where there's no sense of taste. Basically, if people respond, that's good enough. It doesn't matter why they responded.

This brings up an interesting point, which is that if you're running a big data operation as a business, and the business model has something to do with influencing people, getting them to buy through your site, or getting some of you by advertising to reach them, or getting them to renew a subscription if it's a dating site, or any of the other business models, you really don't care whether you're analysis of the big data was correct so long as the business is working, and the business might work for any number of other reasons that are not the scientific validity of your analysis of your data. For instance, the users or the population you're targeting might be captive to some degree because all their data is tied up in it (if it's Facebook) or because they're tied into it because of their mobile contract, if it's something through a mobile device, or for whatever reason. Everybody in the tech business is trying to make their services sticky, so people have a bit of a hard time leaving, or in some cases a very hard time leaving.

You might simply be able to influence people because controlling the data that's right in front of their nose influences their behavior, whether it was ideally placed, or based on analyzing them or not. The mere positioning is so valuable that even if the analysis is askew, that flaw might be masked or mitigated. You have all these other reasons you might be succeeding but there's this sort of fetish for being technical for the sciences. The tech business world, as much as it's become like an ad agency, still loves sciences, still loves technical people, still loves math. So you have this fetishistic elevation of the analysis part of the business, even though it can be awfully hard to tell whether it was the analytic results, or just the lock-in, or the positioning, or some other effect that really resulted in the business being successful.

Therefore, you can have this strange, I'm tempted to call it zombie science, where you can go on assuming that your analytic results are a source of value, and that there's a lot of validity there, even though, actually you haven't closed the empirical loop and control for all the other reasons your business might be working. Furthermore, I mentioned before how the population of real people out there will be scheming back at you if your business is to try to manipulate them by getting the right link in front of them or positioning the right product for them to decide to purchase through you, they're also trying to manipulate you with fake reviews, or fake content to get a high result in a rank, or whatever it is.

There's also a positive feedback loop, where they might change themselves to just fit in with your system better and make themselves more easily targetable, which I think happens a little bit on dating sites, and maybe on Facebook, and whatnot. Sometimes you can get a false positive. From a business point of view, there's no such thing as a false positive. There's just a positive. If the money comes in, that's positive, and that's it. From a truth seeking point of view, from the point of view of scientific method, you can have a false positive. But business can't really distinguish between those two. What I would like to argue for is to stop using the idea of big data as this big rubric to cover all these practices within businesses like Google that don't really have the structure to close the empirical loop to determine what part of their success is based on scientifically replicable and testable analytic results versus science, where that's really all we care about. Science is never, in my opinion, going to just get automatic, and it's very rarely easy.

I'll close with just one recent example. Not too many months ago the big data analysis of some physics experiments seemed to indicate a faster-than-light phenomenon. For months there was this strange sort of preternatural feeling in the physics community of "Wow, could this be real?" Until it finally turned out to be an instrumentation error. A very similar thing happened with a seeming anomaly in the position of some spacecraft that might violate relativity, and that went away when there was close scrutiny. You never see that in business. You don't see businesses saying, "Well, hey, we made a profit, but after a year of scrutinizing it we realized that maybe our analytic results weren't quite as valuable as we thought." A business can't afford to go through that degree of self-examination. It would just be impossible. Google can't do that. They'd go out of business. It would cost too much.

What we have to really learn to recognize is the difference between big data applied in the middle of human affairs, which is more like a form of communication. If you don't like it, you might characterize it as a form of manipulation. You can formulate it as a kind of economic activity. Whatever it is, it's not science. Even if the machines look the same, and a lot of the people who work at them look the same, and even if the vocabulary is the same, we have to learn to distinguish them.

One question to ask is, if there's some ambiguity about whether the value in Internet business really comes from analytic results of setting big data, or if it might be other things like positioning, and lock-in, and all these other things, then why is there such a focus on it? I mentioned one reason before, which I think is a fetishizing of the sciences, which is great for the sciences, of course. But there are some other things, too. One thought I've had is that other people's data, data representing other people, has sort of turned into a virtual currency, if you like, that's traded only at the very highest levels of exclusive investment and power brokerage.

People who are very rich often need to invent among themselves new currencies in which to hold wealth and transfer it into great investments. A great example of that is modern art, which I think probably Picasso first cracked this and then Andy Warhol refined it. But a successful modern artist figures out how to make their works function like a fancy high-denomination currency that can be traded, and that's the key. For our purposes here, I'm not judging that or I'm not bothering with that, it's just something that happens. What started to happen around the turn of the century in Silicon Valley, and it already happened earlier in the world of finance, is the gathering of spy data dossiers of people at large, the Hoi Polloi became a new form of virtual currency analogous to modern art.

It's a funny thing because just as modern art can take on a value, regardless of whether anybody in particular thinks it's great art or not, it becomes iconic and is able to serve as a high-value currency, and it's a different question from its value as art. In the same way, big data can start to function like a high-value currency, independent of its validity. This is an odd thing, but if you gather tons and tons of data, when the chips are really down and you have to be able to get a result from it, it often turns out that you discover you can't. I'll tell you a great example of that is when a site needs to be able to ferret out underage users. It turns out it's very difficult to do. I won't name names here, but some social networking sites have had difficulty doing that, and they complain that it's very hard to do it. It's hard because kids are smart and they're able to deceive some stupid server algorithm.

But the interesting thing about that is that the whole premise of the business is that there are these brilliant algorithms that can analyze people and inaccurately target them with ads or propositions for deals of different kinds, and you can't even tell if they're underage. It doesn't mean that the whole thing is invalid. It just means that when you really come to a hard test, as you would in the sciences, it often turns out that you're living in a bit of fantasy that you have more analytic success than you really do. That's not to say you have none. It's just to say that when you don't really test your abilities, it's easy to maintain an illusion for a long time. There's this sudden almost economic worship of gathering spy data about people at large. If you have a lot of data, you can turn it into finance for some (it could be a financial scheme on Wall Street, it could be some sort of a startup—like if you have dossiers on 50 million people that you came to through some internet-based scheme). You can probably get money for that from somebody in Silicon Valley these days.

It becomes this high-denomination currency of questionable validity. There's a strange comedy about it. On the one hand, there's this enormous world of activism, of people concerned about violations of privacy and concern about these dossiers being kept on us all. If you really look into the amount of data gathering that's being directed to every person now, it's absolutely extraordinary: the many thousands of monitoring schemes online. If you browse even mainstream websites, the many, many times you're seen by a camera that's connected to a network with machine vision software of some kind. The sum of all these things is just gigantic, and yet, despite this extraordinary ogling of everybody, the actual data that comes out of it is of questionable utility, because there isn't really a scientific process of closing the loop and making sure how good the analytic techniques are. There's this sort of very weird situation of highly valuing sloppy spying. What a weird moment to be in. I don't think this will last forever, but it's a very strange moment in our economic and culture history, isn't it?

The desire of people who gather data is to find themselves in an elevated position in terms of their awareness of the situation, situational awareness, their information visibility. In general, it is true that if you have more information and a better ability to understand it, that will turn into wealth and power somehow even if you don't know how in advance. Of course, information doesn't necessarily have to be valid or good information, and I suspect there are a lot of people who are collecting bad information who will discover that they actually don't reap that benefit. An example of the sort of thing that can happen is, let's think about the Amazon price bot, which is looking out at the cost of books everywhere, and makes sure that Amazon's never undersold. The result of that is that Amazon has this superior perch for gathering information. If there is some little bookseller somewhere who wants to do a sale or give away a book promotionally, the price bot will globally reduce the sale of that book to remove that local information vantage from that. So there's no more local information vantage for that person. Locality goes away, and all you have is this global information vantage because of the superior information gathering capability. That's how information turns into money in just one example.

One way to think about this extraordinary fetish for business big data is a fantasy, which I don't believe can ever really be achieved in the long term, of turning a business into a touring machine, approximately, or into a computer.

Let's ask what we mean by a computer in the first place. A computer is a sort of conceit that we apply to some little part of the universe. We draw a frame around some piece of reality and we say within this piece of reality we're going to interpret things as being representation of digital information, and so what that means is we set thresholds. We say if the voltage is above some amount it's a one, and if it's below some amount it's a zero, or whatever it is—some sort of a scheme like that. By using this system of thresholds we can take this piece of reality, which is after all just a part of the same thermodynamic system as anything else, and we can interpret it as being something quite different, a deterministic system. In order to maintain it for any substantial period of time, we have to manipulate it to keep it within the specifications of these thresholds, and that means throwing out waste heat on the environment around it. That's why computers run hot.

In the same way, if you could turn a business into a computer, where you're taking in data from the world, and you're generating a profit base and analyzing that data, wouldn't that be sweet? One of the things you see again and again in business schemes that are based on big data is they attempt to remove themselves from any actual activity. Here's what I mean by that. If you're a big data scheme in finance, you do your best to never be the actual deal, but you're always the in between system for the deal. If there's some fraudulent security passing through, you didn't do that. Similarly, if you're doing some consumer internet scheme, or if there's like copy written material uploaded to YouTube, or something, it's a problem of that person. You're sort of at arm's length, and lily white, and untainted, because all you want to have is that information superiority where you can analyze the data, and the risk is taken by all the other people, the Hoi Polloi, who are moving through your superior data position. This, in a sense, is attempting to make a giant computer as a form of commerce, and the risk that all those individuals are taking that you are able to remove yourself from is analogous to the heat that's given off at the back of a computer, the waste heat that comes about as a result of the work that's needed to maintain the allusion of there being this deterministic system that meets all these threshold requirements.

The problem with doing this is that the whole purpose of a market, the whole purpose of capitalism is to package risk and reward so that people can hedge them, and if the risk is rated at outward and the reward is funneled into a simulated giant sort of computer system through big data, it makes the market dysfunctional.

There's no one 'capitalism'. I have found that capitalism comes in many varieties. I've come to admire Keynes more than I ever did before, and I started to think of him as a computer scientist, in the sense that I think he acknowledged what we would understand today as some of the properties of an energy landscape in a way that some other economists perhaps don't, in that you can have a local peak in a energy landscape, and you can think of his idea of stimulus as being trying to hop over to an ex-peak rather than just staying your local optimization.

I find a tremendous sophistication in his work, and I think the real argument shouldn't be between liberal and conservative economists, but it should be between people who only understand local optimization and people who understand that optimization is a more complex problem, and that there isn't any magic formula for global optimization, but there can be better than local. You can jump over a valley to a better peak. I wish that that discussion was a little drier and more mathematical, instead of so charged because of the history of Marxism and everything, which I think is unfortunate, and has really tainted it.

An interesting example of localization for me is the notion of the Pirate Party, and file copying as being the ultimate form of expression in democracy (which is so popular). Recently, we've had a file copying church that's apparently gained some sort of legitimacy and visibility in Sweden. The pirate parties, of course, have been able to gain seats in some European countries. To me, this is a local optimization, because what you're saying is that the more information is free, the more democracy there is, and that's true for local people, but this is very much the fallacy of certain older experiments and physical production, where if everybody decides through consensus decision-making what to produce physically, or if there is some communist party that does it, it can seem as though you're pushing some sort of optimum but in a market system it just seems that even more happens and there's another peak that's better. I suspect that something like that is also true for information and expression. A market system for expression would actually produce more and better expression and of more beauty and use to people, but we're obsessed with the local keep of the free and Pirate Party style. That's an interesting argument because of such orthodoxy about the local peak, it becomes hard to talk about that distant better peak. This is always the problem you run into with people who are obsessed with the optimization they can see in front of their nose.

In this sense, what to me is a simplistic conservative or libertarian mindset about the economy and the simplistic mindset about the super free are both similar in that they're both attached to this immediate local optimization right in front of them, and failing to acknowledge that the overall landscape of possibility might be more complex.

The people who are saying there can't be any stimulus there can only be austerity and this is the only path to optimizing the world sound very similar to the people who are saying we must not monetize information, there can't be any restrictions, there can only be free flow of information, and that's the path to perfect democracy, and expression, and culture. These people sound almost identical, even though traditionally they probably would cluster with conservative and liberal values. But they're really the same. They're both forms of exactly the same mathematical naiveté.

I think Keynes was correct in pointing out that a certain kind of courage is needed to explore an energy landscape, and there's a bit of risk taking and empirical patience that's needed to find better peaks, but they're out there.