The importance of the data lake has continued to grow as organizations contend with the growth of unstructured data and its sources. By one estimate “80% of worldwide data will be unstructured by 2025” [1] meaning that companies must take a look at their data lake now to ensure they’re prepared.

To set up a data lake that works well alongside existing databases and data warehouses, companies must ensure that the data lake has sufficient scalability, integration, and deployment options to deliver all data, including unstructured data, for analysis when and where it is required. Automated governance is also essential to help ensure the data can be trusted, as are storage options so companies can choose what fits their current architecture and data lake needs the best. Each of these topics is covered in depth in the eBook, Building a robust, governed data lake for AI but a sneak preview is also available below.

Data Lake scalability, integration and deployment options

The data lake is undeniably one of the most important data management tools for collecting the newest and most varied sources of data. Data lakes help collect streaming audio, video, call log, sentiment and social media data “as is” to provide more complete, robust insights. This has considerable impact on the ability to perform AI, machine learning and Data Science; it isn’t too much of a stretch to say that the data lake is, at least in part, the basis for the future of these capabilities. However, in order to make the most of the data lake it must be scalable, integrated and widely deployable so that no data is missed and all data can be used easily.

Taken together the concepts of data lake scalability, integration and deployment options typically fall under the umbrella of enterprise readiness. It’s easy to see why; each of these traits speak directly to the data lakes ability to perform its core duties of ingesting data when and where it is required and then providing it for analytics.

The data lake must scale extensively, quickly, and at low cost. This is often achieved by using clusters of inexpensive commodity hardware. Having this available reduces the likelihood real-time data ingestion will be interrupted and also opens the opportunity to economically store cold, historical data from a database or warehouse

Federation capabilities should always be part of the data lake. Faster than enterprise service bus (ESB) or extract, transform, load (ETL) processes, it provides an easier way to break down silos across data management. A SQL-on-Hadoop engine like IBM Db2 Big SQL is recommended.

And even when federation capabilities exist, multiple deployment options are helpful – particularly since 45% of businesses run at least one big data workload in the cloud. [2] On premises, multicloud and hybrid solutions should all be offered so companies can address compliance needs by putting the data lake behind an on-premises firewall or efficiency needs with a pay-as-you-go cloud model. The ability to combine data lake locations as needed in a hybrid environment enables businesses to make choices that suit their unique needs as situations arise and be more flexible.

Automated governance for the data lake

Data lake governance covers multiple areas ranging from the improvement of data integration spoken about previously and data cataloging to more traditional data governance and self-service data access. Automation is key across these areas to ensure that DBAs and data scientists use their time on higher value activities.

Though the function of integrating data is handled by capabilities such as federation, governance plays a vital role in facilitating the process with in-line data quality and active metadata policy enforcement. Experts expect data integration tasks to be reduced by 45% through the addition of ML and automated service-level management by 2022. [3] In addition, ML capabilities can help synchronize data from databases to cloud data warehouses. Look for AI-powered solutions like IBM Data Stage to help deliver the best data lake integration.

Data cataloging, on the other hand, helps companies understand the data in their data lake better. It does so through defining data in business terms, enabling better visual exploration of the data and the ability to track data lineage. Seek solutions like IBM Watson Knowledge Catalog that take these capabilities to the next level with automated data discovery and metadata generation, ML-extracted business glossaries and automated scanning and risk assessments of unstructured data.

What was traditionally been called data governance aligns most closely to security including compliance and audit readiness. While they cannot be overlooked, automation is vital so that as little manual effort as possible is expended in these areas. Namely, products like IBM InfoSphere Information Governance Catalog should be used to automate the classification and profiling of data assets and automatically enforce data protection rules established to anonymize and restrict access to sensitive information. It should also allow quick incident responses by flagging sensitive data, identifying issues and enabling easy audit response.

The self-service data access of governance most directly helps data scientists. Instead of wasting time cleansing data, this can be done with solutions like AutoAI that automates the data preparation and modeling stages of the data science lifecycle. The result is insights that are arrived at more quickly and trusted more thoroughly because clean data has been used.

Selecting the best data lake storage option

Multiple storage options exist for the data lake, allowing companies to choose the one that fits best with the current data management architecture and existing skill sets. Data lake vendors should offer object storage, file storage, and Apache Hadoop.

Object storage is based on units called objects that contain the data, metadata and a unique identifier. It provides the ability to scale computing power and storage independently – delivering cost savings in dynamic environments. Line-of-business applications, websites, mobile apps and long-term archives all benefit from object storage

The best file storage options allow transparent HDFS access and file access to the same storage capacity. Yottabyte scalability, flash acceleration and automated storage lifecycle management also speed performance and data access while providing opportunities for cost savings. Security features are also available such as live notifications, end-to-end encryption and WORM or immutable data.

Apache Hadoop, used by 62% of the market, [4] is built on open-source and relies on community support for improvement. Fault tolerance, reliability and high availability are key components. However, users should be aware that, unlike object storage, processing and storage capacity are scaled together.

Access more in-depth information and data lake case studies

Data lake success is predicated on a wide range of factors; ignoring any of them could be the difference between having well informed insights delivered in time to best the competition and lagging behind. Go in depth on a range of data lake topics in the eBook, Building a robust, governed data lake for AI and see how others are using the data lake within three industries. If you’d like to talk to an expert about the data lake directly, you can also schedule a free 30-minute conversation.

Originally published on IBM Community Blogs.