Hadoop data point is used to configure connectivity to Hadoop database. For each Hadoop instance, a separate data point will have to be created. Multiple databases within the same Hadoop instance can be added in single data point. The Hadoop data point will be associated to any Hadoop data object created and data flows defining Hadoop as the native processing platform.
To work with Hadoop Data Point, follow below steps:
Step I: Create a New Data Point
- To open and edit an existing data point, refer Opening Data Point.
- To create a new data point, refer Create New Data Point.
Step II: Provide connection details
1. To connect to Hadoop instance following details need to be provided in the Properties tab.
- Process Engine: Specifies the framework to be used for accessing and executing in Hadoop.
- HiveServer2 HA Configured: Check this option if the HiveServer2 High Availability is enabled.
If High Availability is not enabled then, provide the host and port to connect
- Host: Specify the hostname or the IP address of the Hadoop system
- Port: Specify the HiveServer2 port associated to the Hadoop system. The port used can be found from hive-site.xml against the property hive.server2.thrift.port. The default value is 10000.
- If High Availability is enabled then, provide the Zookeeper Namespace
- HiveServer2 Zookeeper NameSpace: Specify zookeeper namespace as hivesever2. This is the default value. If different then, the value can be found from hive-site.xml against the property hive.server2.zookeeper.namespace.
- Hive Security Authentication: Specifies the type of security protocol to connect to the instance. The authentication type setup for the instance can be found from hive-site.xml against the property hive.server2.authentication. Based on the type of security select the authentication type and provide details.
- Simple: Provide App User and Password to connect.
- Kerberos: Provide additional details under Settings tab. This is covered in later section.
- LDAP: Provide LDAP User and Password to connect.
- SSL: Provide App User, Password, Trust Store Path, and Trust Store Password details.
Kerberos & SSL: Provide Kerberos details under Settings tab.Also, provide Trust Store Path, and Trust Store Password details.
Note: To use the project parameter for the password, check the Use Project Parameters option, and you can view and select the required Project Parameter from the Password drop-down.
- Hive Transport Mode: Applicable only for hive process engine. This specifies the network protocol used for communicating between hive nodes. Select between Binary and HTTP. The Hive transport mode can be identified from hive-site.xml against the property hive.server2.transport.mode. When selecting HTTP specify the value for the field HTTP Path. By default its value is cliservice. The HTTP Path value can be identified from hive-site.xml against the property hive.server2.thrift.http.path.
- Hadoop Distribution: Provide the distribution of Hadoop being connected to.
- Distribution Version: Provide the version of the Distribution chosen above.
- Execution Engine: Specify the type of execution engine to be used. Chose from MR (map reduce), TEZ and Spark. The execution engine used in the instance can be identified from hive-site.xml against the property hive.execution.engine. Most of the Hadoop distributions provide TEZ as their default execution engine.
- Using Hadoop Config files for execution: If you want Hadoop cluster config files to be utilized during execution of data flows involving this Data Point, enable the check box against "Use Hadoop Conf files for execution". During execution the config files present in the location specified in the field "Hadoop Conf Path" will be used to connect to Hadoop cluster.
- Jdbc Options: Specify the options that should be used along with JDBC URL to connect to Hadoop.
For example, following details are provided in JDBC Options to connect to Hadoop: user=diyotta, password=****, db=TEST_DB.
- Hadoop Conf Path: By default Diyotta picks the Hadoop configuration files from the HDFS. In order to override this, specify an alternate location to pick the files. This is normally done if Diyotta should use different configuration settings than that set globally for the Hadoop instance. The config files need to be accessible to all the agents that can be used in the Hadoop data point.
- Enable Transaction Manager: Enable and disable the transaction manager during runtime by selecting the check box. When enabled, ensure that the cluster allows this property to be modified during runtime. This property is applicable for CDH 7.x and above.
- Mandatory field names are suffixed with *. To establish the connection, provide all the mandatory property field values.
- It is required that for the Hadoop Distribution and the version chosen, the Diyotta Hadoop extension be installed on the Agents. For installation, refer to the page Installing Hadoop Extension.
- All the fields in the Properties tab can be parameterized using project parameters. To parameterize the fields, refer Working with Project Parameters.
2. Agent: To assign or change the associated agent click Change. The Change Agent window appears and displays the list of available Agents. From the list, select the required Agent Name.
Note: To search for a specific agent, enter the keyword in the search bar, and the window displays the search result list. Select the required agent and click Ok.
- If Default agent is assigned to the Project then automatically, the Default agent will be associated with the new Data point created.
- If Default agent is not assigned to the Project then, no agent will be assigned automatically and appropriate agent needs to be assigned to the data point.
- When connecting to the Agent server then, the agent installation user should have appropriate privilege to access the path where file will be placed.
- When connecting to the remote server then, the firewall needs to be opened from the Agent server to it and user specified to connect should have appropriate privilege to access the path where file will be placed.
3. Provide additional settings that will be used to connect to the Hadoop. For this, refer Configuring Settings for Hadoop Data Point.
Step III: Test the data point connection
To validate that the data point is able to connect to the Hadoop data point database using the details provided, refer Test Data Point Connection.
Step IV: Provide database connection details
Manage the databases required to be accessed through the Databases tab. Multiple databases can be added here.
1. To add a new database, click on Click Here.
New entry for database is generated and success message is displayed at the bottom right corner of the screen.
- The "Name" field is a friendly name that can be assigned to the database for easy identification. This friendly name is displayed when a database needs to be chosen from the data point and when the database association with other components is displayed.
- Provide the physical name of the database in the "Database" field. When clicked on the entry in the "Database" field a drop-down appears with the list of databases in the system. As you enter the database keyword, the drop-down shows the specific database. The database name can either be selected from this drop-down list or you can manually enter.
- To assign a database to be used for creating temporary tables as part of processing the data (generally referred to as tform database), select the check box under Transforms field.
- The Transforms field is available only for those type of data points which are supported by Diyotta as data processing platform.
- It is mandatory to assign a database as transform in the data point when, that data point needs to be assigned during the data flow creation and used as the processing platform.
To view the database drop-down, it is a prerequisite to test the connection. For more information, refer Test Data Point Connection.
- To search for a specific database, enter the keyword in the search bar, and the page displays the related databases.
2. You can also copy database from another Data Point of same type and paste here, to do that:
Click the drop down arrow on Click Here, to see the paste option.
- Following operations are allowed on the database entries: Add, Cut, Copy, Paste, Up, Down, Delete, and Search.
- From the list of databases, multiple databases can be selected and we can perform/apply these operations.
Step V: Save the data point
To save the changes made to the data point, refer Saving Data Point.
- If the changes made to the data point need to be reverted and not saved then, refer Reverting changes in Data Point.
- Once the data point has been created and the changes have been saved then, Close or Unlock the data point so that it is editable by other users. For more information, refer Closing Data Point and Unlocking Data Point.
Step VI: Modify the configured Extract, Load and Run properties
- The default values for extract and load properties can be configured in the Admin module and these properties reflect in the Studio module.
- The extract and load properties set in data point are by default used in the source and target instance of the data flow and the job flows.
- It is a good practice to set the extract and load properties as per the company standards in the data point.
- However, if needed any specific property can be overridden in the data flow or job flow.