Data Scientists: Prospecting for Gold in the Mines of Big Data

17/07/2012 16:06 BST | Updated 15/09/2012 10:12 BST

Sci-fi movies like 'The Matrix' and the well known cyber-punk anime 'Ghost in the Shell' envisions a world where existence is divided between the "real world" that we live in and an alternative "virtual world" in the realm of computers. Science fiction writers and scientists have long speculated about the evolution of Artificial Intelligence or AI which will bridge the gap between the real world and this "virtual world" by creating "thinking" machines or programs that can freely negotiate the complexities of our world, without constant guidance from humans.

In the nascent years of computing science in 1950s, Alan Turing, one of the fathers of the field, predicted that by the year 2000, computers with 120 MB of memory would be created, and that they would be able to pass the Turing Test, which tests a machine's ability to exhibit "human-like" intelligent behaviour. While Turing's prediction regarding AI was well off the mark, the 120 MB landmark in computer memory was crossed years ago. But we are still at least decades away from creating a full-fledged AI.

The developments in computer hardware technologies have not been commensurate with similar strides in software evolution, well at least not to the extent of creation of "thinking programs" anyways. Since the development of the first integrated circuit 1958, it has been observed that the processing speed and memory of computing devices have been increasing exponentially, doubling at roughly every two years. This is famously called Moore's Law.

So where does all this lead us? At present, personal computers with 3.6 GHz processors are available on PCs, an increase of more than 300 times the capacity in the 1980s. And more importantly, data storage capacities have grown even more, and terabytes of storage memory are commonplace even on mobile computing devices. Throw in an all pervasive and ever expanding internet and increasing proliferation of mobile communication devices that generate a constant stream of data from the real world, and we have an ever burgeoning pile of data in our midst.

To put things in perspective, as of 2012, according to IBM, 2.5 quintillion bytes of data is being created every day. (That is a number followed by 18 zeroes). This is the era of 'big data.' When organized collections of data expand to such a large size and complexity that it becomes difficult to work on them with ordinary database management tools, they are loosely defined as big data.

But it is not as if big data was not around before. Scientific institutions, telecommunication companies and other data centric businesses, and government organizations playing 'big brother' have always had to deal with huge volumes of data. For example, when you shop on Wal-Mart, the data regarding your purchase, along with a million others every hour, is downloaded to their databases, which contain an estimated 2.5 petabytes of data. (1 petabyte = 1024 terabytes). That is more data than the entire collection of books in the US Library of Congress.

In the age of internet, big data has just become more commonplace and accessible. And it is growing, fast, and at astronomical rates. Every time we log onto the internet, every time we post on Facebook or Twitter, or purchase something on Amazon or EBay, or just do a Google search, we leave a trail of data. And all this goes into the databases of these websites or companies.

This is where data scientists enter the picture. Let us take the case of Facebook. They have a goldmine of information about millions of people, their preferences and tastes, in their database. In raw form, that data is useless. And since we haven't yet developed AI capable of autonomously sifting through complex information variables, we need humans who can guide computers to navigate the labyrinth of big data and extract ('dredge' or 'mine' would be more apt words) usable information from it.

The process is just like digging for gold. From all the piles of user data, if data scientists are able to extract information like product preferences of users, Facebook can sell that to advertisers and companies and save them the effort of identifying customers. And isn't that exactly what Facebook has been up to? All the technology giants, be it IBM, Google, Microsoft, Facebook, anybody and everybody are all into big data and data science.

Most of the things that we now take for granted online, like trending topic on Twitter, instantaneous results on Google search, they all depend on data scientists to function efficiently. But data science is no cakewalk, as scientists involved in the field would testify. Database management at metadata scales is a constantly evolving field. Imagine trying to find needles in an ever growing haystack. That is just one aspect of the many problems that data scientists are addressing now.

Even though data science is still somewhat of an obscure field in popular consciousness, it is fast being touted as the next big thing. The economic potential of the field is huge. This can be gauged from the fact that IBM, Microsoft, HP, Oracle et al., have pumped in more than 15 billion dollars into software firms specializing exclusively in database management and handling. The future of data science looks promising indeed.

As an endnote, in the cult sci-fi movie 'Minority Report', when the hero enters a mall in the future, interactive ad boards on the walls recognize him by retinal scans and offer him products based on information about him stored their databases. That is one imagination of the future which has been partially realized in the virtual world, thanks to the advancements in data science.