1. Solution Value
A data lakehouse is an analytics repository combining the best of data lakes and data warehouses. Lakehouses store data in open standard formats accessible by a range of analytics and data science technologies and add an analytics engine imbued with query optimization and processing technologies once exclusive to data warehouses. Lakehouses facilitate using an array of technologies across the data lifecycle without having to copy or convert the underlying data. They pave a friction-free path to data-driven business operation by facilitating inclusion, sharing, and onboarding of more data, more quickly.
Cloudera Lakehouse, part of the Cloudera Data Platform (CDP), is an end-to-end solution with streaming, data engineering, analytics, data visualization, machine learning, query tooling, governance, and lineage, all working cooperatively. It is entirely deployment target-agnostic (on-premises, Kubernetes, hybrid, and multicloud); it is governed, managed, secured, and monitored from the Shared Data Experience (SDX) unified management and control plane; and it uses data warehouse-class query technology and is compatible with the Apache Iceberg table format. Iceberg is fast, updatable with ACID-compliant consistency, and open, with contributions not dominated by any single vendor.
2. Urgency and Risk
Data-driven business operation has morphed from competitive advantage to necessity, and a combination of data lake and data warehouse capabilities is essential to get you there. The data lakehouse model provides the ideal combination and should be high on your list of priorities. Waiting is no longer an option, and benefits will come in short order. Without a truly open data platform, your organization will be hard-pressed to integrate with all data ecosystems, all tools and engines across the full data lifecycle, and all data, regardless of location.
Stitching together best-of-breed solutions can lead to integration and support headaches—even nightmares. However, it’s important to mitigate vendor lock-in. A cohesive end-to-end system that uses open technologies is, therefore, imperative. In addition, many organizations falter when deploying data governance, access, discoverability, and self-service in a coordinated fashion. Look for a platform with governance and access controls built into the core. All of this is especially important for large organizations with multiple divisions and/or subsidiaries, where flexibility and consistency are required.
- Having a data lakehouse develops your team’s data literacy, habits, and hygiene. Making a variety of well-curated data available encourages the use of data as a go-to resource for virtually all business decisions.
- A data lakehouse evangelizes data use through discoverability of data resources and data products, enhancing trust in data through governance.
- Data-driven decisions translate to greater efficiency and competitiveness, ultimately leading to lower costs and higher revenue. This is especially helpful in recessionary environments, when market conditions are even more difficult to navigate through hunch and intuition.
- Platforms that support the development of data products incentivize business domains to share and evangelize their data, leading to greater adoption of the lakehouse and greater organizational participation in data driven-operations.
- Data-driven operation builds the best foundation for machine learning and predictive analytics once practices and processes are well-established.
Figure 1. Benefits of a Data Lakehouse
4. Best Practices
In the age of data lakes, it is of utmost importance to leverage open data formats for maximum compatibility and reusability, eliminating the need for data movement and duplication, both of which may run afoul of data protection regulations. It is also essential to leverage data warehouse-league query technology, using massively parallel processing and query optimization, rather than a passive engine that merely converts SQL to imperative code.
Additional best practices include:
- Implementing your data lake on a multicloud- and hybrid cloud-capable platform, which can provide a consistent experience across locations, with the freedom and portability to shift environments as and when needed.
- Selecting a platform that can ingest data from a variety of sources with end-to-end data lifecycle capabilities.
- Implementing bronze, silver, and gold “medallion zones” for the lakehouse and the creation of domain-specific data environments to facilitate the data mesh model and development of data products.
- Ensuring your lakehouse can interface with your existing data platforms and technologies via connectors and federated query.
5. Organizational Impact
The success of data lakehouse implementations is strongly influenced by engagement—facilitating personnel to incorporate data-driven analysis in their business decisions. Whether in the realm of investment, pricing, customer satisfaction, or strategy, getting access to the relevant data, understanding it, and blending it with data from other areas must become second nature. Initial training and ongoing mentoring will help; successful implementation of the lakehouse, its governance, and its usability will make a huge difference.
Customer satisfaction should rise, as data-driven operations tend to be more sensitive to customers’ needs and are more proactive overall. Executive management may struggle to deputize middle management and individual workers with more decision-making authority once confidence in the data-driven approach is established. It is imperative that all accountable stakeholders be guided through the process of granting more autonomy to team members, and there must be ways to measure and monitor the impact such changes are having.
Data lakehouses provide enhanced access, confidence, and comfort level with using and analyzing data for day-to-day business. Furthermore, a properly implemented data lakehouse can increase communication and information sharing between business units, especially when data products are developed within the data mesh approach to data management.
- Individual business users become more at home with data resources and analysis when a proper data catalog is implemented and open data formats are used, allowing analysis with users’ preferred tools. Initial training and ongoing mentoring is critical to success.
- This familiarity with data begets confidence and a corporate culture of citing data in decision-making.
- Non-technical users can participate fully and power users can emerge to act as data champions within their teams. Some may decide to upskill in certain data disciplines.
- IT can be freed from providing base-level support and implementing ad hoc analysis of data in response to frequent, ephemeral business user questions and requirements.
Pricing is calculated based on data services usage in the public cloud. The number and type of cluster master and worker nodes impact pricing, as does the particular workload, which can range from core data lakehouse to data engineering, streaming data processing, and data science. Annual 24/7 costs can vary from as little as $15,000 for data hub workloads on small clusters with modest node instance types to $300,000 on medium-sized clusters with more powerful node types up to high six-figure pricing for very large, powerful data warehouse clusters.
24/7 operation may not be necessary as, based on requirements, one or multiple clusters can be provisioned, deprovisioned, or resized very efficiently in the public cloud and even on-premises when container-based deployment is used. In general, modest implementations are affordable and still deliver significant value. Higher investment will yield bigger rewards.
6. Solution Timeline
Concrete value—for example, faster performance, reduced analytic costs, increased user productivity—can be realized as soon as business domain data is inventoried and governed, which can happen in weeks. Broader value accumulates quickly and can continue for three to six months as an increasing number of business domains are included, analytics is enabled, and the data lakehouse platform is fully implemented. Additional, substantial value is achieved post-implementation when the organization has culturally transformed to data-driven practices and processes—about a year from initial roll-out. As customers implement more use cases on their lakehouse, they are able to launch new data products and add new revenue streams, realizing greater value from their investment.
Plan, Test, Deploy
Establishing baseline governance groundwork is critical and must precede any deployment. Implementation should be piloted with between one and three business units and tested rigorously before broader deployment is initiated.
Plan: Establishing a minimally viable data catalog and devising data access policies must precede any implementation. This governance work, along with use case and query requirements for one to three business units, will constitute the planning stage.
Test: Data lakehouse pilot rollout to the first business units should proceed with monitoring usage, performance, user satisfaction, and accuracy. Each business unit must have a rollout champion and additional team members participating and providing feedback.
Deploy: With learnings from pilot rollout(s) incorporated in the deployment plan, additional business units should be brought online in staged sequence. Catalog and access policies must be validated. One champion per business unit will receive train-the-trainer (TTT) coaching, then train their teammates in the use of the data catalog, query tools, and data visualization.
Cloudera’s open lakehouse on Iceberg with ACID compliance is available today across Azure and AWS, and when data services become available on Google Cloud, Iceberg support will be provided there as well. It works with the five CDP processing engines to provide true multi-function analytics. Iceberg support on the private cloud is in tech preview, and Cloudera is adding enhancements for multicloud workload portability, expanded data cataloging, improving performance, and building materialized views on Iceberg tables and security and governance across hybrid deployments.
Technologies will inevitably change over time, but the modular makeup of CDP, the unified control plane provided by SDX, and the open compatibility of the Iceberg data format best orient a customer toward a future-proofed platform, architecture, and approach.
7. Analyst’s Take
A data lakehouse is an analytics repository that combines the best of data lakes and data warehouses, using an array of technologies across the data lifecycle critical to data-driven business operation. A properly implemented data lakehouse can increase communication and information sharing between business units. Customer satisfaction should increase, as should employee productivity, but management must be guided carefully through the process of granting more autonomy to team members.
Cloudera Lakehouse is an end-to-end platform that is deployment target-agnostic, with unified governance, security, and management/control planes. The company has a formidable customer roster and track record, enterprise sensibility, innovative platform engineering, and pioneering involvement in the modern data stack.
8. Report Methodology
A Gigaom CXO Decision Brief explores the value a technology can bring to an organization through the lens of the executive audience. It is focused on large impact zones that are often overlooked in technical research. We work with vendors directly and use our engineering and executive experience to highlight the most critical aspects, including the value statement, benefits, and best practices of important technologies. The aim is to help you, the end customer, find success in the decision process.
9. About Andrew Brust
Andrew Brust has held developer, CTO, analyst, research director, and market strategist positions at organizations ranging from the City of New York and Cap Gemini to GigaOm and Datameer. He has worked with small, medium, and Fortune 1000 clients in numerous industries and with software companies ranging from small ISVs to large clients like Microsoft. The understanding of technology and the way customers use it that resulted from this experience makes his market and product analyses relevant, credible, and empathetic.
Andrew has tracked the Big Data and Analytics industry since its inception, as GigaOm’s Research Director and as ZDNet’s original blogger for Big Data and Analytics. Andrew co-chairs Visual Studio Live!, one of the nation’s longest-running developer conferences, and currently covers data and analytics for The New Stack and VentureBeat. As a seasoned technical author and speaker in the database field, Andrew understands today’s market in the context of its extensive enterprise underpinnings.
10. About GigaOm
GigaOm provides technical, operational, and business advice for IT’s strategic digital enterprise and business initiatives. Enterprise business leaders, CIOs, and technology organizations partner with GigaOm for practical, actionable, strategic, and visionary advice for modernizing and transforming their business. GigaOm’s advice empowers enterprises to successfully compete in an increasingly complicated business atmosphere that requires a solid understanding of constantly changing customer demands.
GigaOm works directly with enterprises both inside and outside of the IT organization to apply proven research and methodologies designed to avoid pitfalls and roadblocks while balancing risk and innovation. Research methodologies include but are not limited to adoption and benchmarking surveys, use cases, interviews, ROI/TCO, market landscapes, strategic trends, and technical benchmarks. Our analysts possess 20+ years of experience advising a spectrum of clients from early adopters to mainstream enterprises.
GigaOm’s perspective is that of the unbiased enterprise practitioner. Through this perspective, GigaOm connects with engaged and loyal subscribers on a deep and meaningful level.