Cloudera clusters should meet ever-changing security criteria set by governing agencies, companies, industries, and the public as a system meant to support huge types and quantities of data. Cloudera clusters are made up of Hadoop core & ecosystem elements, all of which must be safeguarded against several risks to preserve the confidentiality, integrity, & availability of the cluster’s services and data.
As a component of a GDPR compliance framework, Cloudera delivers a common platform to aid with the security & governance of private consumer information, with the capacity to detect breaches, implement policies, analyze data lineage, and undertake audits.
Security Prerequisites
Data management system goals, including privacy, integrity, and availability, necessitate securing the system on multiple levels. T these can be categorized based on both broad operating goals and technical concepts:
Perimeter Access to the cluster must be safeguarded against several threats posed by internal and external networks and by various actors. Appropriate packaging of firewalls, circuits, subnet masks, and the usage of formal and informal IP addresses, for example, can offer network isolation. Before getting access to the cluster, authentication procedures guarantee that users, processes, & applications identify themself to the cluster and show they are who they say they are.
Content in the cluster should be kept safe from unwanted access. Similarly, communications between cluster nodes must be safeguarded. Though if network elements are captured, or hard disc drives are physically taken from the system by malicious actors, encryption techniques ensure that the contents are unusable.
Access To a specific service or piece of data inside the cluster must be authorized on a case-by-case basis. Once users have verified themselves to the cluster, authorization methods ensure they can only see the information and collect the processes on which they’ve been granted explicit authority.
Visibility The term “visibility” refers to the ability to see the history of database information and comply with data governance regulations. All activities on data and its lineage—source, changes, and so on—are logged thanks to auditing methods.
The cluster can be secured to fulfill specific business requirements using security capabilities included within the Hadoop environment and external security infrastructure. Different security techniques can be used at different levels.
Levels of Security
The diagram below depicts the many levels of security that can be configured for just a Cloudera cluster, ranging from non-security to compliant ready. The security you set for the cluster should rise as the complexity and volume of information on the ensemble grow.
Security Architecture for Hadoop
The diagram below shows some of the numerous components at work in a Cloudera enterprise cluster in production. The diagram emphasizes the importance of securing clusters that ingest data from local and foreign sources that may span many data centers. Authentication or access controls must be applied across these multiple inter-and intra-connections and to all customers who want to query, perform operations, or even browse the data contained in the cluster.
External digital data are authenticated using Flume and Kafka’s built-in protocols. To develop and submit jobs, data scientists & BI analysts can utilize interfaces like Hue to interact with data from Impala and Hive, for example. All of these exchanges can be secured with Kerberos authentication.
Transparent HDFS encryption using an enterprise-grade Keys Trustee Server can be used to encrypt data at rest.
Ranger (for services like Hive, Impala, or Search) and HDFS Access Controls can both be used to enforce authorization policies.
Apache Ranger can be used to give auditing features.
Non-secure
There is no security setup. Clusters that aren’t secure should never be utilized in production because they are subject to attacks and exploits. Only for proof-of-concept, demo, and throwaway platforms are these platforms recommended.
1st-class security
Login, authorization, and auditing are all configured. Users & services can only access the cluster after verifying their identities. Thus, authentication is set up first. Then, authorization techniques are used to assign rights to the system and user groups. Procedures for auditing maintain track of who has access to the clusters (and how). Platforms for development are recommended.
2. Data Governance and Security
Encryption is used to protect sensitive information, and key management systems manage Encryption keys. Data in Meta storage has been subjected to auditing, and the metadata in the system is checked and updated regularly. The cluster should be set up so that any data object’s provenance can be identified (data governance). It’s a good idea to use it for staging and execution.
3. Ready to Comply
All data at rest and in transit is encrypted in the secure CDP cluster, and the access control solution is fault-tolerant. Auditing procedures adhere to industry, regulatory, and regulatory standards (for example, PCI, HIPAA, and NIST) and extend beyond CDP to other systems that connect with it. Cluster administrators have received extensive training, an expert has approved security processes, and the group can withstand technological scrutiny. All production platforms are recommended.