Does it matter where you store your data? Is one home better than another? Brickendon Consultant Ramanan Venkat takes a look.
If a lake is a water-filled basin surrounded by land, with rivers or other outlets feeding or draining it, then it is quite easy to understand the concept of a data lake. According to James Dixon, CTO of software company Pentaho, who coined the phrase, a data lake is like a body of water in its natural state with data flowing from various streams (source systems) into the lake. Users have access to the data lake to examine, take samples or dive in.
Put simply, all data is loaded from the source systems and no data is turned away. Data is stored at the leaf level in an untransformed way or nearly untransformed state and schema is only applied to fulfil the needs of an analysis. As a result, data transformation happens only during schema application, meaning that the data in the data lake will always remain untransformed.
By contrast, Dixon likens a data mart (a subset of a data warehouse) to a bottle of water. It is cleansed, packaged and structured for easy consumption.
So, what are the main differences between a data lake and a data warehouse and why do they both matter?
A data lake retains all data, not just data that is in use today, but data that may be used at some point in the future. When data goes into the lake, it stays there forever so that it can be used at any point for analysis. This is possible because data lakes use commodity hardware instead of the proprietary hardware used by data warehouses. Commodity, off-the-shelf servers combined with falling costs of cheap storage makes scaling a data lake to terabytes and petabytes economical.
By contrast, in order to build a data warehouse a considerable amount of time needs to be spent analysing data sources, understanding business processes and profiling data to decide which entities should be included. The result is a highly-structured data model designed for reporting. (The selection process is done to simplify the data model and to conserve space on expensive disk storage that is used to ensure a high performance from the data warehouse.)
Data lakes are data-type agnostic and can store either structured or unstructured data. Non-traditional data sources, such as web server logs, sensor data, social network activity, text and images are some examples of unstructured data. New uses for these data types continue to be found every day and it is highly important that all data is being captured in the data lake. Data scientists can use their expertise to develop schemas using this data and add new dimensions aimed at improving the business. This approach is popularly known as ‘schema on read’.
By contrast, data warehouses usually consist of structured data extracted from transactional systems that serve quantitative metrics and the attributes that describe them. These are popularly known as facts and dimensions respectively. Since the data warehouses are well defined to fit incoming data formats, they are popularly known as ‘schema on write’.
Data Lakes support all users. In most organisations, 80 per cent or more of the users are considered ‘operational’ users. These users usually request reports, see their key performance metrics or slice the same set of data in a spreadsheet every day. The data warehouse is ideal for these users because data is well structured, easy to use and understand and is purpose-built to answer any questions. Another 10 per cent of the users analyse the data. They use reports from the data warehouse as a source, but often go back to source systems to get data that is not available in the warehouse or bring in new data from outside the organisation.
Finally, the last percentage of users perform deep analysis. They may create new data sources based on research, mash up different types of data and come up with entirely new questions to be answered. These users, the data scientists that use advanced analytic tools like statistical analysis and predictive modelling, may or may not use the data warehouse as they usually go beyond the data warehouse for their research.
The data lake approach supports all these users equally well. The data scientists have a very large and varied data set to play around with, while the rest of the users make use of a more structured schema view of the data from the data lake.
Data lakes also adapt easily to change. All data in a data lake is stored in its rawest form and is accessible to anyone who needs to use it. This empowers users to go beyond the structure of the data source to explore data in novel ways and answer their questions at their own pace. If the result of an exploration is shown to be useful and there is a desire to repeat it, then a more formal schema can be applied to it and automation and reusability can be developed to help extend the results to a broader audience. If it is determined that the result is not useful, it can be discarded and no changes to the data structures have been made and no development resources have been consumed.
By contrast, data sources must be well defined in a data warehouse to support the incoming data. Any change in the format of incoming data or any change within the data warehouse (new or existing data sources) cannot be done straight away and it would need considerable development and testing effort to keep the data warehouse up-to-date to match the business needs.
Data lakes provide faster insights, enabling users to access data before it has been transformed, cleansed and structured. This allows users to get to the data faster than the traditional data warehouse approach. However, this easy access to the data comes at a price. The structuring, normalisation and cleansing work typically performed by the data warehouse development team may not be done in a data lake environment, leaving users in the driver’s seat to explore and use the data as they see fit. However, the first tier of business users, as described above, may not want to do that pre-analysis work and simply want their reports and KPIs. These operational report consumers will make more use of the structured views of the data in the data lake if that resembles what they have always had before in the data warehouse. The difference is that these views exist in a data lake primarily as metadata that sits over the data in the lake rather than physically rigid tables that require a developer to change.
To conclude, relational databases have been in existence for a long time and it is natural to feel more biased towards them since they have been used to build complicated applications for decades. However, they come with some problems like scalability, increasing complexity and rising costs to maintain the applications.
The data lake schemas or the NoSQL database engines offer more flexibility and control to mould raw data to meet a variety of requirements and help offset the over-head costs that relational databases bring. Relational DBMS vendors like IBM, Oracle and Microsoft have identified this gap in the market and have since come up with their own flavours of NoSQL solutions.
In short, there is a need and place for both types of data storage solution depending on what you are doing and what you want your outcome to be. In fact, it is quite likely that in the future the two systems will be used alongside each other as complimentary products.