As data analytics, machine learning and AI continue to rapidly evolve, so, too, does the need to acquire, access and catalogue large amounts of data required to power data analysis. This has given rise to something called a “data lake”.
The standard model for data storage has been the data warehouse but in a traditional data warehouse, the data must be classified and formatted carefully before being inputted to the warehouse (schema on write). Because the data is so formally structured, the questions must be carefully defined, as well. A data warehouse is expensive, too, and affordable only to corporations large enough to support the enormous costs needed to design, build, house and maintain the data center infrastructure and associated software costs.
The Data Lake Difference
The data lake is also a storage repository but with several significant differences:
- The data lake can hold all types of data: structured, semi-structured and unstructured.
- The data doesn’t have to be filtered or sorted before storage – that happens when the data is accessed (schema on read).
- The costs of a data lake are vastly diminished thanks to scalable storage on demand in a cloud-based platform like Microsoft Azure which also eliminates costly infrastructure.
Optimus Information recently asked Ryan O’Connor, Chief Technical Strategist and Milan Mosny, Chief Data Architect, to talk more about data lakes and how Optimus is using the technology to further the business goals of our clients.
Q. How do you define a data lake?
Milan: A data lake holds data that is large in volume, velocity or variety. This is data acquired from logs, clickthrough records, social media, web interactions and other sources.
Q. So, when would a business use a data lake versus a data warehouse?
Milan: A business unit will use a data lake to answer questions that a warehouse can’t answer. These are questions that need huge amounts of data that won’t necessarily be present in a warehouse. The data lake can supply answers that will increase the agility of the decision making or the agility of the business processes. Without a data lake, a business will have to use an ETL (extract, transform and load); they will have to define the ETL, build it and the load the data into the warehouse before they can begin to create the questions to get the answers they’re looking for. The data lake eliminates the need for the whole ETL process and saves enormous amounts of time.
Q. Is there a minimum size or amount of data needed to start a data lake?
Milan: I wouldn’t worry about minimum sizes. The best way to approach creating your own data lake is to start with a variety of data and then grow the lake from that point of view. One of the strategic strengths of a lake is that it holds so many different kinds of data from multiple (and different) sources. Variety is the key and that’s where I would focus.
Q. Data lakes are typically on cloud platforms like Azure. Can a data lake be on premises?
Milan: It can be, but only really big companies can justify the cost of running the extra servers needed to store the data. Why would you even bother when Azure and other cloud platforms are so scalable and affordable? It doesn’t make much sense, financially. Plus, Azure contains so many of today’s powerful data lakes technologies like Spark, a lightning-fast unified analytics engine, Azure Databricks and Azure Data Lake Analytics. In fact, Microsoft has a suite of superb Azure analytics tools for data lakes. The nice thing about these tools is that you can work on storage which is extremely affordable with Azure. So, you dump your data into storage on Azure and then you can spin up the analysis tools as you need them – without having to spin up the Azure cluster at the same time.
Q. Since a data lake can hold all sorts of data from different sources, how do you manage a data lake?
Ryan: The key is how you organize the ecosystem of ETLs, jobs and tools. You can use Azure Data Factory or Azure Data Catalogue which lets you manage the documentation around the datasets, what’s in each dataset and how it can be used and so on. As Milan said, Microsoft has recognized the massive impact of data lakes and has already produced some tremendous tools specifically for them.
Q. How is Optimus going to introduce data lakes technology to its customers?
Ryan: Well, we are already implementing data lakes in our analytics practice. What we’re offering clients right now is a one-week Proof of Concept (PoC) for $7500 CAD in which Optimus will do the following:
- Identify a business question needing a large dataset that cannot be answered with a client’s current BI architecture
- Ingest into Azure Data Lake storage
- Define 1 curated zone
- Create a curated dataset using Spark, Azure Databricks or Azure Data Lake Analytics with R, Python or USQL
- Create 1 Power BI dashboard with visuals that reflect the answer to the business question
- Provide a Knowledge Transfer
Q. Speaking of Power BI, Optimus is a huge fan of this tool, correct?
Milan: That’s right. We love it because we can build out stuff quickly for our customers, especially when it comes to PoCs. For visualization of data, nothing beats Power BI, especially when it’s applied to data lakes. It can connect to Hadoop clusters, to large storage volumes – in fact, it can connect to just about anything, including APIs.
Q. What is the purpose of the “one-week PoC”? What will your customers get out of it?
Ryan: We are only doing one curated zone as part of our offer. A customer would have multiple business problems they would want to answer, of course but this one-week PoC gives them a taste of what is possible. A large project would require a full analyze phase, architecture, build out, test and deploy methodology. A platform would also need to be chosen to show the data.
Milan: Our customers can expect us to set up the basic structures on Azure for them and we’ll give them examples of business questions around which they want to build so they can see how to expand it to other areas, as well.
A data lake can bring enormous opportunity for powerful data analytics that can drive significant business results. How it is set up and used is the key to how successful a role it can play in your company’s use of data analysis. Optimus Information can help by showing you what a data lake can do with our one-week PoC offer. Contact us today for more information.