This week we are proud to speak with Chuck Kelley, an internationally known expert in database technology. He has over 30 years of experience in the design and implementation of operational and production systems and data warehousing. Mr. Kelley has co-authored and contributed to four books about data warehousing and has been published in various trade magazines. More about Mr. Kelley and a full biography can be viewed after the transcript.
Q: How did you get into the field of database technology and data warehousing?
Very early on when I first started in the computer world I started leaning toward data. And back then we didn’t have databases, we had flat files, and we had to deal with the data that way. I’ve always enjoyed finding ways to manage data and making sure we can get to the data as fast as possible to do all kinds of analysis.
Then, in the late 1980s, I was reading a lot of books about it, and there was on author writing on this new topic called data warehousing, and I found it quite interesting. So I called him one day and learned that he lived about an hour away from me so we met for lunch and had a good discussion and have been friends ever since. That was my introduction to data warehousing. The first fifteen years of my career was building and managing operational systems, and the next fifteen years was building tools for data warehousing and analytics.
Q: Can you tell us a bit about some of the books you’ve put out on this subject?
Most of the books are older, when data warehousing was just taking off in the late 90s and early 2000s. The first book was about building data warehousing in the deck environment. The second one was a part of a DM Review ‘Ask The Expert’ column, and we noticed that the same question was coming over and over again. So we built the book around that and called it Impossible Data Warehouseing Situations: Solutions from the Experts and just had the experts answer the most common questions. However, there were plenty of differing opinions among the experts, so as a reader of the book you could see both sides of the issue and decide which way you wanted to go. At the time there were also a group of people writing the Internet Encyclopedia, and I wrote the section on data warehousing.
Q: What do you recommend for companies that can’t or don’t want to manage their own data infrastructure?
There are a lot of companies that offer those computing services, such as Amazon or Rackspace. And of course it’s a question of budgets and capabilities. I think that companies such as Amazon and Rackspace offer a reasonably priced option if you don’t want to manage your own data, and they can grow or shrink as needs change.
Q: What are the largest threats you see right now in terms of security, and how are we combating them?
The challenges of security threats are everywhere, and they’re increasing the more our devices are connected. These devices aren’t being protected, and so security is a huge area. There are small things you can do. I liken it to car locks on a car. They’re there to keep honest people honest, but if someone wants to steal your car they’ll steal it, and there’s not much you can do about it. Very professional hackers can get into almost anything, but as the first step, you should do things like encryption, or keep data on different subnets to protect your information and stop SQL injections from coming in, etc. Those protective measures are very important and very easy to do; you just have to do them. There’s a TV show all about this called CSI:Cyber that’s all about hacking into printers and setting them on fire, and things like that. But the more things are connected to the internet, it does open up more opportunities for hacking.
Q: What are the industries you see with the most potential growth in data warehousing?
Every industry. We are still at the tip of what we can do with data warehousing, and have been for a long time. The whole manufacturing process could have wild potential growth. Banking can still have wild growth in data warehousing. Data warehousing now is primarily used for reporting, but you’re still not seeing much of it used for analysis or projections, except in very rare cases. That’s where the future of data warehousing is, not only in financial reporting; though that’s important too. But we need to get people to think about how to apply twenty years of data to project what we can do in the future, and that is where I see the potential growth.
Q: What’s the difference between Data Warehousing and Big Data?
I believe that Data Warehousing is the concept of storing data in order to do analysis on it. How can I collect the information to have it ready for fast analysis? Big data is really more about moving the data to a common location, but not doing any integration or analysis until the time I begin accessing it. Big data is about collecting as much information as you can, and then having data scientists figure out how to use the data.
Q: What advice do you have for acquiring and analyzing a massive amount of data?
There are a lot of tools out there to be able to do this, but none of them are really for the business end user. A lot of the big data is not ready for the end user today. There are some tools out there that are starting to be able to take the unstructured part of big data and tie it to the structured part of big data, but for the most part it’s difficult. At the moment data scientists need to be able to develop data correlations in Python or some other language to be able to match the data for analysis, but in five or ten years we’ll have tools that will be able to access the massive amounts of data, but I think what will happen is that you’re not going to sub-second response times.
Q: What criteria should be used when deciding to host on-premise or within the cloud?
That’s a question I have thought about for a long time. There’s no firm answer. The big things to consider are whether or not you can manage it and adapt to it. Can you handle the possible security vulnerabilities. Do you want shared hardware or not. If you’re sharing hardware, you might not be able to say that your whole environment is SUX compatible, for example. You might not be sure whether to put it in the public cloud or the private cloud or to host it all yourself. There are different levels of customization. Security is the largest thing to consider. In a public cloud you have to work within that security environment, whereas in a private cloud you have much more control. Depending on the type of data you have that will help you decide which to use.
Q: When people talk about the five Vs of big data: Volume, Velocity, Variety, Verification, and Value, is this the best framework to use to approach a big data project?
I’m not sure it’s a framework as much as it is understanding the problem that you’ll have with big data. Most systems aren’t set up to handle all the information that big data will bring to you. The Vs are about those things. Volume is really about collecting and storing data. The amount of data is massive. Sensors and video clips add up to huge amount of data over a five year period. Velocity is how fast that data is being generated. Variety is about all the different types of data you can get. Veracity or Verification is how well you can trust your data to be accurate, and Value is how useful that data is to you.
Q: What are the most common challenges in a Big Data project?
I think the biggest challenge is presenting the data. There are many people who build a big data system and want to turn their business community onto it, but that won’t happen if they don’t understand the tools or if the query takes five or six hours. So we have a good handle on the acquisition of the data and how to store and retrieve it. How we present it is where the fun part really begins.
Q: What is your advice in how to move forward in setting up Big Data?
The first thing is to sit down as a group and deice what you want to accomplish, rather than spending money for nothing. Once you think you know what you want to use the data for, then you can determine how to make it secure and do it in a reasonable timeframe and break it down into smaller components. That’s the approach that has worked the best with the people I’ve worked it.
I just think that when you think about Big Data and Data Warehousing they can be seen as antagonistic, but I actually view them as complementary. Big Data is just a part of a data strategy for large enterprises.
Q: Is SQL really the right language to handle data analytics, and how do you see the future of SQL progressing?
In the last year there have been so many SQL engines that have been placed on top of operating systems that deal on top of Big Data that I don’t see SQL going away. SQL is the language got grabbing data and merging it together so that you can present it to the user. That won’t stop in the next 20 years. SQL is going to be here for a long time.
Mr. Kelley is an internationally known expert in database technology. He has over 30 years of experience in the design and implementation of operational/production systems, operational data stores and data warehouses (data marts). Mr. Kelley has managed small teams as well as being responsible for all data within an organization. Mr. Kelley has co-authored or contributed to four books on data warehouse and has been published in numerous trade magazines on database technology, data warehousing, meta data management, master data management, data governance, and enterprise data strategies.