XML to Teradata - Data Ingestion

Introduction

Diyotta supports various types of Files such as COBOL, Apache Avro, Fixed Width, Fixed Length, Double Quotes, Fixed Block, Flat File, JSON and XML. It provides data ingestion from any of these File data sources to target platforms optimally. This document illustrates how Diyotta facilitates data ingestion from XML to Teradata.
For a better understanding of the use case, we have included sample XML file – CustomerOrders.xml and its defined schema as CustomerOrders.xsd.  Download the following files to use as reference. 

CustomerOrders.xml

CustomerOrders.xsd

Data Ingestion – XML to Teradata 

To create a Data Flow, the first step is to create a Data Point. The next sections will guide you through creating a Data Flow.

Create a Data Point

1. Click New Data Point on the right menu bar under the Actions tab. This launches a window prompting you to select the data source type.
2. Select the data source as File Server and click OK as illustrated in the screen shot given below.



3. The Data Point details are displayed on the canvas. You can change the name, give a short description or declare the Data Point as private under this General tab.



4. Click the Properties tab to select the location of the file. You have the option of changing the Agent from this page by clicking the Change link. A File type Data Point is now created.


Import a Data Object

The next step is to import schema of XML using XSD at the location previously mentioned while creating the Data Point.
1. On the right menu bar click Import Data Object. A new Import Data Objects window opens. Select the data source type as XML and from the list on the right, choose the Data Point which you created earlier.
2. Click Ok.



The Import XML window opens, and the wizard will take you through the process of importing the COBOL data file.
3. Select the location of the file, whether Local or Server. In this case, click Server.



4. A list of all the files on the server appear. Click to select the file to be imported. Import the XSD format.
XSD (XML Schema Definition), a recommendation of the World Wide Web Consortium(W3C), specifies how to formally describe the elements in an Extensible Markup Language (XML) document. XSD is used to express a set of rules to which an XML document must conform in order to be considered 'valid' according to that schema.



5. The details are now populated in the Import XML window and you can see a preview of the schema. Click Next.



6. You can see the attributes in this window. If required, you can modify the names and data types of the attributes.



7. Click Finish and do not forget to save the Data Object.
8. Open the Data Object on the canvas. Under the General tab, the general details pertaining to the Data Object such as the name, group name, associated Data Point etc. are displayed.



Attributes Tab 
The Attributes tab has the following fields. Enter the details as per the description against each field.

Sr. No

Field Name

Description

1

Level

Level number is used to specify the level of data in a record. Level numbers describe the hierarchy of the data items/ variables declared. They are used to differentiate between elementary items and group items.

2

Attribute

This field consists of all the attributes of the data file.

3

Type

This field specifies whether the attribute is an element or attribute.

4

Data Type

Data types describe the characteristics of the data.

5

Prec

Specifies the maximum number of characters that the selected attribute can accommodate.

6

Scale

Specifies the number of characters that the selected attribute can accommodate after the decimal.

7

Min Occurs

Specifies how many minimum times a field or group of fields is repeated.

8

Max Occurs

Specifies how many maximum times a field or group of fields is repeated.

Note: If Max Occurs is defined as unbounded in XSD then it becomes the source instance. Conversely, whatever source instances are required, are defined as unbounded in XSD.

9

Description

Specifies any other description of the attribute.

Properties Tab
Define each field according to the description given in the following table. 

Sr. No.

Fields

Description

1

File Name

Specifies the source XML file name.

2

Name Space

An XML namespace is declared using the reserved XML attribute xmlns or xmlns:prefix, the value of which must be a valid namespace name.

3

Root Element

Specifies which element is the Root

Creating a Data Flow

Once a Data Point and Data Object is created, you are ready to go with creating a Data Flow.

To create a Data Flow, follow the given steps:
1. On the right menu bar, click New Data Flow. This will launch New Data Flow window prompting you to choose the processing platform.
2. In this case, select Teradata.
3. Change the name of the Data Flow typically to suit the kind of Data Flow you are creating.
4. Choose the layer where your Data Flow will reside and click Ok.



5. The Data Flow will open in the canvas. You can now start creating the design. Next step is to create a source. On the right menu bar, click the Transformations tab.
6. From the Sources panel click or drag the XML icon on to the canvas.



7. A new window opens. Select the Data Object from the list and click Ok.



8. You can now see the source Data Objects on the canvas. In the lower pane, all the details of the source Data Object are visible. Here, you can see two source instances- Customer and Order. 


Two source instances appear because in schema (XSD), both- Customer and Order are defined as Unbounded.  

This is clearly indicated in Data Object > CustomerOrders > Attributes as shown in the screenshot given below. 

9. Make sure you define the Extract properties. Enter the File Name and set the File Type, Column Delimiter, and Row Delimiter as required.


Use Case - I

There are two source instances on the canvas. Each source instance can be linked to a separate table. 

In this case, we will create two separate target instances. 


1. Click to select the source Data Object (Customer) and on the right menu bar click Create as Target.
2. This will open a new window. Select the target database type. In this case, select Teradata.
3. From the right menu select the Data Point and click Ok.



4. Similarly create target for the other source instance. The Data Flow should now typically look like this- 


5. Click Run on the right menu bar to execute the Data Flow.
6. Select the target object. In the lower pane, click the Data tab to get a preview of the data.

You have successfully ingested data from XML File type to a Teradata database.

Use Case - II 

Two different source instances can be seen here on the canvas. To create a single target table we will use a Joiner transform here. We are joining the tables on the basis of CustomerID as both the source instances are independent of each other. 

1. On the right menu bar, double click or drag the Joiner icon on to the canvas.



2. Next, link the Customer source instance to the Master and the Order source instance to the Detail of the Joiner. The details/properties of the Joiner are seen in the lower pane. The window also displays a warning – Joiner condition is invalid.


3. The next step is to validate the Joiner by defining the Join conditions. Let's define the Joiner conditions.
Go to Properties and click the Joiner Condition icon.



4. Define the conditions in the Expression Editor window. In this example we are joining the Customer and the Order tables on the basis of the customer id. Once the expression is written, you must validate the syntax. Click Validate > Ok.



5. Go to Attributes > Select Attributes. In the new window select all the attributes you want in the Target table. Save the design.



6. Click to select the Joiner icon. On the right menu bar click Create as Target. Select the Data Point of database type where you need the table.



7. The Teradata target can be seen on the canvas linked to the joiner. Name the objects suitably.
The Data Flow would now typically look like the one given below-



8. Click Run on the right menu bar. You will see a message –



You can go to the Monitor module to check the detailed status report of the Data Flow executed or on the right menu bar click Interactive. Click the Data tab to get a preview of the result in the target table. 


The Run  Status tab gives a summarizes the execution status.



Use Case - III

We are joining two tables on the basis of Parent_ID.

In this use case, we will join both the tables on the basis of Parent_ID. The source instances - Customer and Order. 

Since both are independent source instances and are not related as parent and child, both can be joined on the basis of Parent_ID. So, just change the joiner condition in the Expression Editor

On the right menu bar click Run to execute the Data Flow. Check the result in the preview under the Data tab. 


Use Case - IV

We are joining two tables on the basis of Parent_ID and Group_ID as both are related as parent and child elements. 

In this use case, we will are considering an XML file which has two tables wherein one is a child of the parent. We will join both the tables on the basis of Parent_ID and Group_ID. To 

Since both the source instances are not independent are related as parent and child, both can be joined on the basis of Parent_ID and Group_ID.  For a better understanding of the use case, we have included sample XML file – Employees1.xml and its defined schema as Employees1.xsd for your ready reference.

Employees1.xml

Employees1.xsd

Step 1: Create File type and Teradata Data Points. 

Step 2: Import Data Object Employees1. Two source instances can be seen on the canvas - Address and Employee. 

Step 3: Create a Data Flow.

1 . Drag the Joiner on to the canvas. Link the Employee source instance as the primary object and Address source instance as the secondary object. Define and validate the conditions in the Expression Editor as shown in the screenshot given below. 

2. Select the Joiner as click Create As Target and create a Taradata target table. The Data Flow must typically look like the one given below. 


 3. Execute the Data Flow and see the preview under the Data tab in interactive mode. 

You can see all the columns.