From data lake to data lakehouse

For decades, data analysts have struggled to consolidate structured and unstructured data for analysis by machine learning (ML) and business intelligence (BI) tools. Now, the emerging concept of the “data lakehouse” is bringing new possibilities for data science applications. A data lakehouse combines the management and structure of a data warehouse with the flexibility, cost efficiency and scale of a data lake.
Proponents view data lakehouses as the future of data storage and analytics, poised to displace the traditional data warehouse. At Keybank Capital Markets’ recent Emerging Technologies Summit, keynote presenter Adi Ghodsi, CEO, Databricks with David Conte, CFO, discussed how early innovator Databricks is advancing adoption of the data lakehouse concept across industries.
Founded in 2013 by a group of researchers—including Ghodsi—at University of California-Berkeley (UC Berkeley), Databricks is the world’s first and only lakehouse platform based in the cloud. It is also the only software-as-a-service (SAS) company to appear both in the Gartner Magic Quadrant for database management systems (DBMS) and data science, and the quadrant for dependable and secure machine learning (DSML) systems. The $38 billion startup recently completed a $1.6 billion Series H funding round in September 2021 to help it compete against companies like Snowflake Inc., Amazon.com Inc. and Alphabet Inc.’s Google.
The disruptive power of predictive data analytics and AI
In Ghodsi’s view, data analytics and artificial intelligence (AI) will disrupt every industry, following the lead of the FANG — Meta (formerly Facebook), Amazon, Netflix and Alphabet (Google) — that were the first to leverage data and AI to drive their business strategies. Uber’s another clear example of the disruptive power of data and AI it used to transform the taxicab business.
The real benefits of AI and machine learning, said Ghodsi, come when an enterprise can not only generate reports about past events, but anticipate future customer actions and trends. For example, Nationwide, the insurance and financial services company, uses robotic process automation and machine learning to underwrite life insurance policies. Walmart uses AI and machine learning in many operational areas, including assessing produce to reduce food spoilage and waste.
In the industrial sector, Rolls-Royce used its Databricks platform to double the lifespan of the engines it manufactures for airplanes around the world. The company first unified supplier data and streaming data generated by wireless sensors placed in 13,000 Rolls-Royce jet engines in use around the world, and then began using predictive analytics to anticipate maintenance requirements.
Now, Rolls-Royce can learn, for example, that engines used on a particular Qatar Airways route require more frequent servicing than engines elsewhere, because of sand accumulation. Serviced to prevent excessive sand buildup, the engines last longer. Conversely, planes in other regions of the world may require less servicing than anticipated. Through data-driven maintenance, Rolls-Royce was able to eliminate 22 million tons of carbon emissions, while extending the time period between planned maintenance downtime periods by 50%.
Why few enterprises fully leverage their data
However, few companies truly achieve the full potential of their data with the powerful analytics tools available today. Most continue to struggle because they use disparate, incompatible data platforms. Most store all of their data — whether from email messages, wireless sensors, e-commerce systems or other sources — in data lakes. To use business intelligence (BI) tools like Microsoft Power BI or Tableau to understand past events, they must copy and format their data in a data warehouse. The result is two copies of the data, which employees have to maintain.
To complicate matters, the same staff must manage data lineage and audit trails between the two databases for data privacy and security compliance – which is, if not impossible, very difficult. Credit card data, medical records, and customer data all pose potential regulatory compliance risks, which is why a data lakehouse company like Databricks is “essentially a security company,” as Ghodsi observed.
The lakehouse concept
The data lakehouse overcomes the traditional challenges by eliminating data silos and combining data governance, security, data engineering, data streaming, BI and machine learning capabilities into one unified platform. Databricks’ foundational technology is Delta Lake, an open-format storage layer for streaming and batch data, whether structured, semi-structured and unstructured, making data reliable and accessible.
Without a data lakehouse, the software stack today is “awful and super-confusing,” said Ghodsi. Typically, an enterprise must integrate multiple products from multiple data warehouse and data science vendors, or integrate 10 to 15 services from different cloud vendors. Often, an enterprise may need to assemble 40 or 50 different software elements to create the data and analytics platform needed.
In addition, many companies today depend on vendors they selected 20 years ago, even though newer and better solutions are now available, because migrating is costly and risky. With its open-source foundation, Databricks solves the problem of vendor lock-in because the platform can be integrated with “the big wide ecosystem” of services, from the popular Tableau and Microsoft Power BI tools to data governance vendors like Privacera and DBT, and various cloud-based services.
Where the market is headed
Many competitors are getting into the data lakehouse business, although Databricks is the current leader. In Ghodsi’s observation, it is difficult for data warehousing vendors to offer robust machine learning capabilities. However, the marketplace has room for many different data lakehouse vendors. Only an estimated five to 20% of all enterprise data has moved to the cloud, leaving a very large market, said Conte. The largest cloud-service providers, like Amazon Web Services, have been talking about data lakehouses for some time, but implementation means integrating multiple different services.
Ghodsi predicted that Microsoft and Google will be making big moves within the next six months or so, creating tools that enable users to directly access and analyze data stored in the cloud. However, users will need a data and analytics technology stack that works with multiple cloud services.
Development of Databricks
From its roots at UC Berkeley, where the founders helped companies like Facebook fine-tune their engagement algorithms, Databricks has since created the open-source, free Apache Spark™, Delta Lake and MLflow applications. While its mission to “democratize” the use of data and data science is unchanged since its founding, Databricks has broadened its focus to help a broad range of enterprises leverage their AI, ML and BI tools, and address regulatory requirements pertaining to data management.
Databricks now offers both its open-source software that developers can download, and its SAS cloud-based software. The company has exceeded 80% average annual return year-over-year growth, with 3,000 employees and more than 6,000 small and very large customers processing an exabyte of data per day and launching more than 10 million virtual computers in the cloud per day.
Databricks users historically have been data analysts trained in Structured Query Language (SQL)—even though Databricks is not based on SQL. Now, Databricks is considering the business users and what CFO David Conte calls “the ilities”—particularly simplicity and usability. Toward that aim, Databricks has acquired Redash and 8080 Labs, two no-code/low-code applications that empower the “citizen developer,” and released Databricks AutoML, which also offers a low-code approach to generating baseline models.
With potential use cases for data lakehouses across every industry, from healthcare, financial services and consumer packaged goods to oil and gas, automotive manufacturing and government, Databricks is investing in its expansion by building teams dedicated to different verticals.
The future of data lakehouses
Data lakehouse adoption is still in its very earliest phases, says Ghodsi, but few people realize how pervasive the concept will be in the future. Data lakehouses will “transform every job, every industry, every vertical, primarily in good ways,” he says.
The KeyBank Capital Markets Technology group helps emerging technology companies compete in a rapidly changing world with in-depth events and seminars, actionable insights and market research, and access to capital. To learn more about opportunities in the database technology sector, contact your KeyBanc Capital Markets investment banker.
# # #
Stay ahead of the pulse on emerging technology with KeyBanc Capital Markets
From enterprise software to fintech and cryptocurrency, technology is rife with opportunity—when powered by in-depth expert market analysis. For more information, connect with one of our technology investment bankers at key.com/experts.
To learn more about attending one of our conferences, email the Corporate Access team.
About the 2022 Emerging Technology Summit
The 2022 Emerging Technology Summit attendees included 800+ institutional investors, 115+ private equity/venture capital and corporate development investors, 38 public companies and 110 private companies. The agenda included 68 Fireside Chats/Presentations, 10 thematic panels and 5 Keynotes.
This article is for general information purposes only and does not consider the specific investment objectives, financial situation, and particular needs of any individual person or entity.
KeyBanc Capital Markets is a trade name under which the corporate and investment banking products and services of KeyCorp® and its subsidiaries, KeyBanc Capital Markets Inc., Member FINRA/SIPC (“KBCMI”), and KeyBank National Association (“KeyBank N.A.”), are marketed. Securities products and services are offered by KeyBanc Capital Markets Inc. and its licensed securities representatives. Banking products and services are offered by KeyBank N.A.