The Blog

Big Data: Businesses Must Remember to Take Out Their Rubbish

In the wake of revelations and consumer backlash against PRISM, big businesses around the world are coming under increasing pressure to give users more insight into and guarantees about how they are using their data.

In the wake of revelations and consumer backlash against PRISM, big businesses around the world are coming under increasing pressure to give users more insight into and guarantees about how they are using their data. While common wisdom, when it comes to Big Data, is that you should keep it all, the reality is far different. Whether your customers are businesses or consumers, there's actually a good case to be made for throwing your data away, and here's why:

1. Personal information can end up in the wrong hands

Nobody wants their personally identifiable information (PII) to end up in the wrong hands or to be used for the wrong purpose. Organisations have a duty to their customers to guarantee this to the best of their ability.

Companies have historically used manually-enforced policies to govern what data to keep and for how long. But with 90 percent of all data created in the last two years, this is becoming an increasingly difficult task.

The more customer financial records, patient health information and old e-mails you keep around, the more risk you run in the case of a data breach, whether accidental or intentional.

2. You can't find the needle in the data haystack

The benefit of having more data is often outweighed by the difficulty of finding the information you need, when you need it.

It's true that today's software can search through millions of documents, e-mails and other records. But that also means you get far more results returned, results that you then have to comb through to find the one you're looking for.

As it gets progressively harder to find what you need, it can be tempting to give up and simply hope the information you have is the best.

3. Simply storing data does not make it valuable

Big Data advocates are fond of saying you should store your data and figure out how to analyse it later. But just storing data doesn't get you more insights; using the right software to analyse that data does.

Put another way, data stored only has theoretical value, while data analysed has practical value.

For example, baseball fans had been tracking player and team statistics for years as a hobby. But the teams themselves made player decisions without that data, choosing players purely based on the same non-data driven techniques they had relied on for years.

It wasn't until general manager Billy Beane of the Oakland A's decided to put the analysis of those statistics to work that the data really became valuable, enabling the A's to build a winning team. Had that data remained stored and nothing more, it's value would have been purely theoretical. Its practical value after the A's analyzed it was far greater.


The amount of data being created is growing at a phenomenal rate. Every day, Twitter users send more than 400 million tweets while Facebook users post some 2.5 billion status updates. By some estimates that means people are creating more than half a trillion new words of unstructured data each month through social media. It is evident that new platforms are allowing for better communication, but they are also making it harder to separate the signal from the noise, what matters from what doesn't.

While we as humans rely on our personal preferences and our subconscious to process immense amounts of unstructured data every day - whether it's recognising someone's face and remembering their name, or reading between the lines to understand the nuance of what a friend is saying - for businesses it's a different challenge altogether.

Corporate data is subject to organisational and, increasingly, stringent regulatory requirements. How do we protect our customers' information - what can we delete and what needs to be anonymised? Which document or email from among the many millions we and our colleagues generate each day might be relevant to a business decision tomorrow - or an investigation issue a year from now? It's virtually impossible to tell without the right toolset.

While common wisdom may suggest otherwise, when it comes to data storage, adopting a contrarian point of view makes a lot of sense. The risks of not doing so are simply far too great to ignore.