Cloudera made an explicit and loud statement with the takeover of Hortonworks in October 2018: they intended to be the unquestioned major names in the Big Data sector.
They already had one of the greatest Big Data platforms globally (CDH). Still, now they’re going to the next step: Cloudera Platform (CDP), their newest offering, wants to be the ultimate data answer for all data-driven businesses.
You have seen that the system includes numerous components: developers have introduced some completely new ones, redesigned others, and, as previously stated, will publish some in the future.
The entire ecosystem is too complicated to summarise in a single image, and here are the key components of the Cloudera Platform today:
ClearPeaks is a proud Cloudera partner. Our Cloudera-certified specialists always seem to be up to speed on the latest technologies to stay one step ahead of the competition and provide the best solution for any situation.
We were fortunate to receive an early peek at CDP when we visited the Cloudera Conference in Dubai a few days ago; in this post, we will showcase it, outlining how it ties to CDH, what the first version looks like, as well as what functions it will feature once fully deployed.
1. The Data Cloud for Enterprises
To demonstrate their goals, Cloudera invented the phrase Enterprise Data Cloud to describe what they feel is the current and prospective Big Data business and claimed to become the first Industrial Data Cloud Company.
But what exactly is the EDC?
To put it another way, it is the answer to all of the modern business’s data-related problems:
- Data silos and a lack of tool compatibility data that you can disperse it across many sources
- Security regulations in the workplace and government often impede productivity.
- You can solve it with an Enterprise Data Cloud, which is a multi-cloud, elastic, inter, secure, and open data solution.
Cloudera Data Platform is number two.
Cloudera Data Platform dubbed the “First Enterprise Data Cloud,” is the result of combining the greatest CDH & HDP (Hortonworks Data Platform) technologies into one large, comprehensive PaaS platform for all things data.
But there’s a lot more to it. It also introduces Analytics Services. Also, it is pre-defined cloud-based solutions designed to address the most frequent and critical analytics workloads. It includes Data warehouses, Algorithms, Data Science, Data Flow & Streaming, & Operational Databases.
CDP can handle data in various environments, including cloud environments like AWS, Azure, & GCP (Cloud-Based Platform). Still, it can also automatically auto-scale workloads & hardware up and down to maximize efficiency and reduce costs.
The design is to be the ultimate end-to-end (or Edge2AI) solution for any analytics problem. It provides a single window-of-glass view of all business data and different workloads.
The environment is a cloud-based elastic cluster offered by AWS/Azure/GCP. CPD provides a single-window pane, allowing businesses to monitor, control, and use numerous environments from multiple suppliers in different parts of the world.
Data Lake:
A unique cluster with no processing capacity that provides a common data lake to those services & workload groups running in it. You can couple it to an environment. It comes with a Hive Metastore for sharing table metadata throughout the lake, Atlas as a comprehensive data catalog and lineage repository, with Ranger for vertical security and authorization across the entire ecosystem.
Data Hub is the progression of CDH/HDP into the cloud-based, dynamic cluster that you can deploy with only a few clicks using a wizard-like UI. It is driven by Cloudera Runtime, a new Cloudera distribution. You can have various Data Hub clusters connecting to the same Data Lake and special services and architecture in each environment.
The first Analytics Service is the Data Warehouse.
It’s a customized auto-scaling cluster designed exclusively for self-service DWHs. It uses the same public cloud and works in the same environment as YARN but Yarn doesn’t control. It comprises Database Catalogs & Virtual Warehouses, which are computational resources.
The second Analytics Service is Machine Learning.
The Cloudera Data Scientists Workbench is CDP’s cloud-native edition of such Cloudera Data Science Canvas. It includes the most popular IDEs, such as Jupiter and Zep. The most popular languages and libraries in a unified self-service machine learning system (Spark, Python, Tensorflow, Scala, R, etc.) It uses the same common data lake as the other services, but Kubernetes drive this.
Cloudera SDX stands for Cloudera Disclosure Statement Experience. Thanks to Ranger and Atlas, it provides a single control panel for the overall environment, featuring shared metadata, data catalog and lineage, security, and governance.
Data Catalog is a cross-environment service that provides a data asset repository for all organizational data lakes. This allows users to organize, understand, govern, and curate data across several locations.