The Hive data object will always be associated to an Hive Data point. For details of creating Hive data point, refer Working with Hive Data Point.
To work with Hive Data Object, follow below steps:
Step I: Edit, create or import an Hive data object
- To open and edit an existing data Object, refer to the page Opening Data Object.
- To create a new data object, refer to the page Creating New Data Object.
- To import directly from database objects, refer to the page Importing Data Object
- To create Hive data object based on data object of another type, refer Creating copy of Data Object as other type.
It is recommended to create data object by importing the structure from the database object and Serde format Hive objects cannot be imported.
Step II: Configuring attributes
The Hive Data Object contains following fields. These fields can be edited by clicking on the field and inputting the required value. When importing the structure directly from the database these fields are auto-populated based on the column properties of the database object.
- Attribute: Displays the name of the field as in the database object. This is a text field and name can be defined as needed. It is mandatory that the attribute names in data object exactly match the column names of the database object they are referring.
- Data Type: Displays the datatype associated with the field. This is a list of applicable data types. You can select appropriate type for the field.
- Precision: Displays the precision for the varchar, number and decimal data type. This is a text field and name can be defined as per the data in the file.
- Scale: Displays the scale for the number and decimal data type. This is a text field and name can be defined as per the data in the file.
- Data Sensitivity: This is relevant when data object is used as a source instance. Specify if an attribute has sensitive information and the data in source system is masked. This is required to be specified when the datatype provided in the data object is only indicative of unmasked data and the target structure should map to the datatype for the masked data. You can map the data type of data sensitive attributes to a common datatype in Diyotta Admin's datatype mapping page. For more information, refer Editing Data Type Mapping. The drop down provides the category of sensitivity of the data. Below are the categories.
SPIRSH-Sensitive Personally Identifiable Information with a Real Risk of Significant Harm
SPII-Sensitive Personally Identifiable Information.
PIIPD-Personally Identifiable Information in the public domain
PII-Personally Identifiable Information
- Not Null: Displays if the field can have NULL values. This is a checkbox and can be defined as per the data in the file.
- Key Type: Displays if the field is key field. This is a dropdown with values Primary Key and Foreign key. Select one of these if the field is one of these keys.
- Order: Rearranging the order of the columns for performance improvement.
- Partition: Specify partition column. You can use partitioning to improve the performance of queries that restrict results by the partitioned column.
- Cluster: A clustering key is a subset of columns in a table (or expressions on a table) that are explicitly designated to co-locate the data in the table in the same micro-partitions.
- Description: Displays any details provided for the field. This is a text field and you can enter any detail as need.
- Following operations are allowed on the database entries: Add, Cut, Copy, Paste, Up, Down, Delete, and Search.
- From the list of attributes, multiple attributes can be selected and we can perform/apply these operations.
- To add a new attribute, click Add. By default a new attribute is populated in the last row. If you want to add an attribute at a specific position then, select the attribute in the position prior to it and click Add.
- To search for a specific attribute, enter the keyword in the search bar, and the page displays the related attributes.
Step III: Configuring properties
The properties tab displays the object level details relevant to extract the data from file.
The grid displays following fields,
|Property||Description||Default value||Other possible values|
|External Table||To create target table as external table||No||Yes|
|Managed Table WareHouse Location (Displays only when External Table is "No")||To manage location when target is being created as internal table||hdfs://labhdp253/apps/hive/warehouse/demo_stg.db/branch|
|External Table WareHouse Location(Displays only when External Table is "Yes")||To manage location when target is being created as external table||/tmp|
|Row Format||Specify the row format of the table||Delimited||Serde|
|Serde Jar Path (Displays only when row Format is "Serde")||Define the Serde Jar location in the Hadoop cluster|
|Serde Properties (Displays only when row Format is "Serde")||Comma separated list of SerDe property and its value pairs. These go as “WITH SERDEPROPERTIES” in DDL of the table||"separatorChar"="\u0001","quoteChar"="'","escapeChar"="\u0004"|
|Serde Class ((Displays only when row Format is "Serde")||The fully qualified class name that implements the Hive SerDe interface org.apache.hadoop.hive.serde2.SerDe||com.bizo.hive.serde.csv.CSVSerde|
The character immediately following the escape character is escaped.
This needs to be specified if the text qualifier is provided and the text qualifier character can appear in the source data.
Any ASCII character
|Field Delimiter||Specifies delimiter for structures fields, the default is '\001', which is the CTRL-A.||,|
|Collection Items Delimiter||Specifies a delimiter for array items||\002|
|Map Keys Delimiter||Map is a collection of key-value pairs where fields are accessed using array notation of keys, Map values are delimited by this Delimiter.||\003|
|Line Delimiter||Specifies a delimiter for map keys. You can include a newline character only ('\n'). Newlines must not exist within the data itself, and they cannot be escaped||\n|
|Table Properties||TBLPROPERTIES clause allows you to tag the table definition with your own metadata key/value pairs|
|Null Value||Specifies what string literal should indicate the null value in the extracted data. During the data load the column value matching this string will be loaded as null in the target.|
Any string literal
|File Format||Specify hive built in and custom-developed file format of the file.||ORC|
|Orc Compression Type (Displays only when File Format is "ORC")||For high level compression||ZLIB|
|Orc Stripe Size (Displays only when File Format is "ORC")||Specify the number of bytes in each stripe||268435456|
|Orc Row Index Stride (Displays only when File Format is "ORC")||Specify the number of rows between index entries (must be >= 1000)||10000|
|Orc Create Index (Displays only when File Format is "ORC")||Specify whether to create row indexes||TRUE|
|Input Format Class (Displays only when File Format is "CUSTOM")|
In the file_format to specify the name of a corresponding InputFormat
For example, 'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'.
|Output Format Class (Displays only when File Format is "CUSTOM")||In the file_format to specify the name of a corresponding outputFormat|
|Orc Compression Size (Displays only when File Format is "ORC")||Specifies number of bytes in each compression chunk||262144|
|Compression||Use this option to specify if the data needs to be compressed before transfer from source agent to Target load agent.||FALSE||TRUE|
Compression Type (Displays only when File Format is "Sequence File")
|On choosing a compression format, the user is required to give the corresponding compression codec and compression type.||NONE|
|Intermediate Compression ( Displays only when Compression is "True")||Should the outputs be compressed before being sent across the network. Uses SequenceFile compression.||FALSE||TRUE|
|Intermediate Compression Codec ( Displays only when Compression is "True")||If the outputs are compressed, how should they be compressed? (i.e. Snappy).|
|Orc bloom filter columns (Displays only when File Format is "ORC")||Specifies comma separated list of column names for which bloom filter should be created|
|Orc bloom filter fpp (Displays only when File Format is "ORC")||Specifies false positive probability for bloom filter (must >0.0 and <1.0)||0.05|
|Bucket Size||Specify execute query which has both the partitioned columns and the number of buckets are assigned to the table.||1|
|AICD Enabled||Specify to execute deletes and updates queries on hive SQL. We need to include this flag enabled if DO format is ORC only.||No||Yes|
|Strict Names||Specify this option to allow special characters and they are enclosed within quotes(').||No||Yes|
Note: To view specific property details, enter the keyword in the search bar, and the page displays the related properties.
Step IV: Save the changes
To save the changes made to the data object, refer Saving Data Object.
- If the changes made to the data object need to be reverted and not saved then, refer Reverting changes in Data Object.
- Once the data object has been created and the changes have been saved then, Close or Unlock the data object so that it is editable by other users. For more information, refer Closing Data Object and Unlocking Data Object.
DDL is generated based on the attributes and their properties in the Attributes tab.
Viewing sample Data:
Displays preview of data from the associated object in the database.
Following options are available in View Data tab:
- By clicking Refresh, the data displayed is reloaded.
- By clicking Export, the displayed records are exported and saved as excel to default download folder in the machine.