Got big data? When should you delete?
Got big data? When should you delete?
It is almost impossible at present to not read, hear, or be warned about the Big Data phenomenon and the current explosion in data growth. Right? In fact, a quick review of some notable analysts’ predictions and priorities for 2012 would provide us the following:
- Gartner – CIO technology priority #1 for 2012 is “Analytics and business intelligence” (up from #5 in 2011) 
- IDC – Prediction 10: The Big Data Opportunity becomes real in 2012 
- Wikibon – Predictions #1 ‘Big Data is here to stay’, #2 ‘Hadoop-based Big Data solutions will intersect with traditional enterprise data warehouse offerings’, #4 ‘In 2012 Big Data will be all about the Apps’ 
Even the US White House is making big plans around investments in Big Data R&D – $200M USD in fact. And they are banking on a return on par with the Internet.
Lastly, depending on who we listen to most, predictions on file-based data growth vary between 45% and 55% compounding annually; and since this file-based data represents approximately 85% of our data holdings – this is our current data explosion.
A different view…
However, I’d like to propose a slightly different point of view: We have always had Big Data. In itself – it is nothing new, there has just been some ‘fine tuning’ (or evolution) occurring to warrant the current exposure.
If I cast back to the beginning of my IT career in 1985, the standard or default unit of measure used for volumes, datasets and file sizes was MB and KB. Within around 10 years that had changed to GB and MB; by 2005 it was up to GB and TB; and now today we generally assume TB as an almost standard unit of measure. My point: Every generation of storage administrator has dealt with more data than the one before.
During this period however, two aspects of data management have remained the same: 1) the data management lifecycle; and 2) the top five questions/challenges every generation of storage administrator faced.
First, the data management lifecycle. For most purposes, this has remained something like:
Create –> Access –> Process –> Protect –> Delete –> Create –> …
I highlight Delete above, as herein lies a significant portion of the problem with compounding data growth – We generally do not delete enough data – we hoard it – just in case, or because it is just too difficult to identify what should be deleted.
Secondly, our top five questions or challenges every generation of storage administrator has faced:
- How to access data in a cost effective way
- How to organise, search and process data effectively
- How to backup / protect the data
- How and what do you want to delete
- How to fit the power and space envelope [in the data center]
We have always actually had Big Data, and always had the challenges associated with data growth.
In my experience, the number one problem in the majority of organisations is #4 above – very few organisations have data management policies in place which stipulate data deletion. Naturally then, it just continues to accumulate and accumulate – just like the contents of our store rooms, sheds, and cupboards at home!
Why the current exposure and hype then? Whats new?
Well, I did mention there had been some ‘fine tuning’ or evolution occurring with our data growth and our processing. To justify the current levels of exposure the Big Data topic receives, one must understand two aspects which are different today, as compared to say just 5 years ago. These would be:
- Today we not only have data volumes growing; but we also have a much larger variety of data arriving at an increased velocity – and that variety of data is more complex – because it is mostly unstructured [it is hard to understand using traditional tools and techniques]. The Variety refers to the wider range of data generating sources – social media, video, audio, text messages, internet clicking, search engine requests, location sensing, etc. Velocity refers to the rapid arrival rate of that variety of data – in larger volumes. It all adds up to complexity!
- The mid nineties was the heyday of data warehousing. Such systems are now best positioned as “descriptive analytics” – the ‘what’. They would provide a ‘rear view mirror’ insight into our business. Today we have the maturing of “predictive analytics” – the ‘so what’ and the ‘now what’. Such modern analytic systems capitalising on what I described previously as volume, variety, and velocity – are future oriented in their insight. They help businesses obtain an edge in terms of predicting what’s coming next, what will happen next, what else could I sell this customer.
What to do? What to do?
Clearly, like our home cupboards and sheds, it is not sustainable to keep storing everything for ever. Granted, storage technologies are allowing more to be kept for longer with greater efficiency; but even still – nothing costs nothing (and it never did!)
Hence, ask yourself (or preferably your organisation!) some questions and create an effective data and information management framework. In fact – in this area alone we are seeing the beginning of a new trend – the creation of a senior role known as ‘Chief Data Officer’ – reporting to or equal to the CIO. Such a role would be like a data scientist – and the head of the information management group.
The type of questions I refer to for example are:
Do I want to keep it All? Why?
- Will it be repeatedly used? How often?
- Must I use all of it, or are samples enough?
- How do I want to use it?
- Are there regulatory retention considerations?
- Will it be updated?
- Do I need it in near or real time?
These distil down to –> Understand your requirements to access ageing data.
Once this is understood, then comes where should I physically store it? The storage cost curve keeps going down but it still costs a LOT of money. Take into account and review:
- Hardware and software acquisition costs, power and cooling costs
- Skills, people, time to manage it all
- Onsite? Offsite? Primary and backup? How many copies?
- In a private cloud? A public one?
- Online? Near-line? Archive?
And I would summarise this as –> A place for everything, and everything in its place.
Once these two broad areas of an information management framework are in place, you will be ready for the current Big Data explosion, and ready to explore the accompanying world of predictive analytics.