What is a data lake?
Unlike a data warehouse, a data lake, as a central target system or repository, stores large volumes of all conceivable types of data, from a wide variety of sources, in their native raw format. Data lake solutions accommodate structured or unstructured data formats, for example with information from ERP or CRM systems and sensors, but also images or videos. In contrast to the data warehouse, the exact use of the unprocessed data stored in the data lake concept has not yet been determined. Nevertheless, as with the data warehouse, the purpose of the data lake architecture is to ensure the availability of large volumes of data in advance of business analytics and therefore to enable data-based corporate decisions.
How does a data lake work?
A data lake is a database that makes it possible to import high volumes of data. Even in real time. By storing the data in its original format without defining data formats, schemas or transformations, the loading time is reduced. Data is therefore quickly available and can be more easily updated for big data analyses. The organisational effort required to operate data lakes is therefore much lower than for data warehouses. The impulse to process the loaded data always comes from the user, who can shape it more easily and thereby use it for different or changing analysis goals. Since data lakes have no structure and therefore no restrictions, they are also considered easily accessible. On the other hand, the lack of structure means that only data scientists can competently use lake solutions.
A variant of data lakes is a Hadoop data lake. With the help of several cluster computer nodes made of commercially available hardware, it is primarily used to provide data in the Hadoop system HDFS. Hadoop lakes are used, for example, to bundle existing data sources, integrate network data from remote locations or temporarily store data from overloaded systems. In addition, a Hadoop data lake can supplement a data warehouse, take over the transformation of data and then transfer the already processed information to the data warehouse.
In general, data lakes can be implemented on a wide variety of platforms, for example on-premises, but also via cloud environments such as Google Cloud, AWS or Microsoft Azure.
What to consider with data lakes?
In order to take the first steps in the area of data lakes or in deciding for or against data warehouses, companies should consider the following points:
Data governance mechanisms
The fact that data lakes hold a wide variety of data in a wide range of formats is a great benefit, however, it is also a disadvantage or at least a challenge. In order for companies to actually use data lakes in the context of big data or analytics, they must first define mechanisms to quickly find the desired data and to be able to trust the heterogeneous data stock. If criteria for maintaining data quality and correct data governance are missing, the data lake degenerates into a data swamp and becomes unusable.
Loading large amounts of unstructured data can also be disadvantageous in terms of the required storage capacity, so it is important to ensure that there are sufficient storage options. However, the argument that expensive storage space is wasted is less valid today than it was some time ago. This is because the costs for cloud computing are falling or adapting to the respective use case via pay-as-you-go models. Hence, the computing power of virtual systems becomes cheaper for post-load transformation and the storage of unstructured data becomes more attractive.
Availability of data scientists
One important argument against the use of data lakes is the lack of IT specialists. Business users often find it difficult to analyse unprocessed data, so data scientists and special tools have to be used to prepare the available data for company-specific analyses or to make it usable at all.
Benefits of data lakes.
“Time is money” – this everyday saying also illustrates one of the main benefits of data lakes.
Savings potential through high data availability
As data lakes no longer focus on the structure of the data, but rather transfer the raw data or slightly pre-filtered data directly to the target system, they are consequently available very quickly, historically complete and in larger quantities than in data warehouses. This results in a further time and cost gain if the focus shifts during an analysis. In such cases, there is no need to “go back to the drawing board”, as extensive data is already available in the data lake.
Optimised data lake utilisation through cloud computing
While increasing computing and storage capacity used to come with high hardware costs, such start-up investments can now be eliminated. By using virtual infrastructures or hosted services, companies avoid the risk of both insufficient and unused performance. With Microsoft Azure, for example, cost-intensive analysis services or storage capacities can be switched on and off as needed. But although the prices for cloud computing are falling, the individual and additional services still add up to high amounts as large data volumes grow, which makes a well-considered management concept worthwhile. Because even with machine costs of, for example, only 0.12 euros per minute in the cloud, the monthly expenses can quickly amount to several hundred euros. Costs for disk space and transaction volume are added, as well as the multiplication of the cent amount through numerous, iterative queries by employees. When used sensibly, cloud computing offers real added value. Nevertheless, SMEs are often uncertain about the specific implementation, which is why a hybrid model is particularly suitable for this size of business.
Data analysis as a competitive advantage
Companies that use their data as an asset can position themselves much better in the competitive environment. Artificial intelligence and machine learning are helpful analytical methods for identifying growth opportunities. This is because machine learning uses new data sources such as log files, click or social media-based information or data from smart devices, i.e. devices connected to the internet. The interpretation of this type of new data helps to orient customer acquisition and retention towards demand, to increase productivity through prospective machine maintenance and to recognise market trends at an early stage.
Future decisions through value-based data analysis
A value-oriented analysis of data from the data lake not only provides departments with the necessary KPIs after it has been processed. It also gives data scientists the opportunity to gain new insights and discover new correlations based on really large amounts of data. For the management level, this new capital means that strategic decisions for the future can be made, and innovation processes initiated much earlier and, above all, more soundly than before.
What solution does Lobster offer?
It is evident from the above that there is no clear either/or answer to the question of data lakes or data warehouses.
This is why Lobster offers a hybrid approach and brings together the best of both approaches in its ETL/ELT module for the data integration software Lobster_data. With the Lobster tool, it is possible to manage both data lakes and data warehouses, to write in a big data system such as Hadoop, to perform data cleaning before the load, thereby simplifying map-reduce rules and preventing the risk of data contamination in the data lake.
Lobster also works in a two-tier system – document-oriented with Lobster_data profiles and more line-oriented with the ETL/ELT module. This results in significantly improved performance in sequential processing and optimisation of the required memory consumption.
So Lobster offers a combined solution that includes Lobster_data profiles plus the ETL/ELT module plus the workflow module plus monitoring in the “Control centre” plus a cloud system. As a stand-alone solution, any database structure would only be a kind of motorway from the data source to the data destination without much added value. Only from the overall package do significant advantages of data preparation in the run-up to business intelligence and business analytics arise.