Data cleansing is the process of altering data in a given transform to make sure that it is accurate and correct. The Datacleanse can be applied based on the value in one or more attributes in the connected transform.
To work with Datacleanse transform in data flow, follow below steps.
Selecting the Datacleanse transform to be added to data flow
In the data flow canvas move to Data Flow pane and navigate to Transforms menu. Here, you can either select or drag and drop the Datacleanse transform to canvas. Now, link the required transform to Datacleanse Transform.
Configuring Datacleanse transform
General Tab: Provide the basic details for the Datacleanse transform.
1. The Name field auto populates the transform name and its editable.
2. In the Description text-box, provide a description and is optional.
3. By default, Diyotta does not create temporary tables during execution for the transformation in the data flow. If the temporary table needs to be created for a transform during execution, then, enable the checkbox Persist Data. The temporary table created will be dropped once the data flow executes successfully.
Derived Attributes tab: You can add attributes in the data cleanse transform which will be the attributes from the connected transform.
When there are no attributes present, then click on Click Here to add new attribute.
- When there are attributes already present, then click on (+) Select option and select attributes from transform.
Then Select Link Attributes wizard appears and displays the list of carry forward attributes available in the previous connected source/transform. Choose required attributes and click OK.
The added attributes get listed in the Derived Attributes tab and simultaneously by default all the attributes gets selected and the Data Cleanse window appears. If required, using the data cleanse window you can cleanse the data before proceeding further else select an attribute and click Data Cleanse link.
In both the scenarios, the Data Cleanse window appears and allows you to apply the following operations on the selected attributes:
- Trim: To remove the white spaces or other predefined characters from both the ends of a string
- Left trim: To remove white space or other predefined characters from the left side of a string
- Right trim: To remove white space or other predefined characters from the right side of a string
- Substring: To return a new string that is a substring of existing string excluding both the ends
- Convert to uppercase: To convert all the characters to uppercase
- Convert to lowercase: To convert all the characters to lowercase
- This section allows you to select either lower or upper case, as both the operations cannot be applied at a time on a attribute.
- In the same way, you can select either left or right padding as both the operations cannot be applied at a time on a attribute.
Add padding to left: To set the width of the padding on the left of an attribute
- Add padding to right: To set the width of the padding on the right of an attribute
- Replace nulls in numeric field with '0': To replace the numeric fields ( integer type) with '0'
- Replace nulls in string field with blank: To replace the numeric fields ( string type) with blank
- You can apply following operations on each attribute of string type: trimming the attributes, case conversion, padding the letters, and replace null values with blank.
- You can apply following operation on each attribute of integer type: Replace nulls with '0'
- To replace with existing functions, select the checkbox Replace existing functions and click Apply.
- The fields in the Attributes tab are specific to the data object type associated with the transform. For more information about these fields, refer Working with Data Object and select the specific database data object same as native processing platform.
- Following operations are allowed on the attributes: Select, Cut, Copy, Paste, Up, Down, Delete, Apply template, and Data Cleanse.
- To search for a specific attribute, enter the keyword in the search bar, and the grid displays the search results.
- From the list of attributes, multiple attributes can be selected and we can perform/apply these operations.
Here, each attribute has an associated expression editor to add expression to it. After adding attributes, you can add/edit associated expression in Expression Editor field.
- To add/edit the expression, click the icon beside the Expression Editor field.
- The Expression Editor wizard appears and it allows you to add the required expression. To verify that there are no syntax errors, click Validate. Upon successful validation, the success message appears and click OK.
The expression can include attributes from the transform, database functions, Parameters, Functions, Reusable expressions, UDFs, and Sequences.
- The attributes from the transform will be listed when selecting Transforms from the drop-down. Click on the attribute from the list and it will be added in the editor.
- The list of functions can be seen by selecting Functions from the drop down. The functions that can be used in the SQL is not limited by the list shown. All the hive database functions can be used in the SQL.
- The list of parameters can be seen by selecting Parameters from the drop-down. It will show those that can be used in data flow - Data Flow Parameter, Data Flow SQL Parameter, Project Parameter and System Parameter can be used in the SQL. For more information, refer Working with Data Flow Parameters, Working with Data Flow SQL Parameters, Working with Project Parameters, Diyotta System Parameters.
- List of expressions, UDFs and sequences can be seen under corresponding header in drop-down.
- The Expression field displays the added/edited condition as shown below:
You can use Apply Template option to apply the expression from one attribute to other attributes in the list. You can either apply expression from existing attribute or apply new condition to existing Attributes. As an example, here we are applying function lower( ) to multiple attributes.
- From the list of attributes, select an attribute which has the expression to be applied, and click Apply template icon.
You can also select any attribute and change the expression to be applied in subsequent steps.
- Apply Template wizard displays the derived attribute name along with expression. From the list of attributes, select the attributes to apply the expression.
- In the expression field, you can edit the expression before applying other attributes. If edited, then you can valid this expression by clicking Validate. The success or failure message appears on the bottom of the window.
- Click Apply, to apply the condition to the selected attributes.
- After applying the condition, validate it by clicking the Validate button at the bottom. The success or failure message appears on the bottom of the window.
- Click OK, to complete the changes.
Script tab: The script tab allows you to view the run time script of the transform. The script is generated based on the condition specified in the transform.
- To view the generated script, from the below pane, navigate to Script tab.