Bookshelf v10.1.3.2: Best Practices For Teradata Deployments

Oracle^®Business Intelligence Applications Installation and Configuration Guide > Preinstallation and Predeployment Considerations for Oracle BI Applications > Teradata-Specific Database Guidelines for Oracle Business Analytics Warehouse >

Best Practices For Teradata Deployments

This sections contains the recommended best practises for maximising performance in Teradata deployments, and contains the following topics:

NOTE: The following best practices should be treated as customizations. You should follow standard customization methodology, such as copying the mapping to a Custom folder. You should never make these changes directly in the OTB objects.

LEFT OUTER JOIN Issue

Teradata redistributes the OUTER table according to the join key, unless the INNER table is tiny. If the INNER table is tiny, the INNER table could be simply copied to all AMPs and the OUTER table would not be redistributed.

Many fact SIL mappings need to get the dimensional ROW_WID from ROW_ID/INTEGRATION_ID. For example, W_PER_RANK_FS.ACCNT_ID needs to be converted to ACCNT_WID before loading into W_PER_RANK_F table. Since ACCT_ID is nullable, the join between W_PER_RANK_FS and W_ORG_D is defined as a LEFT OUTER JOIN.

However, the percentage of NULLs in ACCT_ID column can be as high as 50% or more depending on the data set. When redistributing the W_PER_RANK_FS according to ACCT_ID, all rows with ACCT_ID = NULL are put onto a single AMP.

Although a Teradata database usually has hundreds of gigabytes of spool space, the spool space is allocated across hundreds of AMPs. The spool space for each AMP is limited (for example, to two gigabytes).

When a large percentage of W_PER_RANK_FS is distributed to a single AMP, this can result in insufficient spool space being available. This happens when too much data is spooled onto one AMP, not because the spool space is too small.

To work with Teradata's mechanism of parallel processing, and resolving the LEFT OUTER JOIN, SQL must be re-written.

As an example, refer to the following original SQL:

SELECT ... FROM

W_PER_RANK_FS FS LEFT OUTER JOIN W_ORG_D ORG ON

FS.ACCNT_ID = ORG.INTEGRATION_ID AND

FS.DATASOURCE_NUM_ID = ORG.DATASOURCE_NUM_ID

The above SQL should be re-coded as follows:

SELECT ... FROM

W_PER_RANK_FS FS LEFT OUTER JOIN

(SELECT

FS.INTEGRATION_ID, FS.DATASOURCE_NUM_ID, ORG.ROW_WID, ORG.GEO_WID

FROM

W_PER_RANK_FS FS, W_ORG_D ORG

WHERE

FS.ACCNT_ID = ORG.INTEGRATION_ID AND FS.DATASOURCE_NUM_ID = ORG.DATASOURCE_NUM_ID AND FS.ACCNT_ID IS NOT NULL)

ORG ON

FS.DATASOURCE_NUM_ID = ORG.DATASOURCE_NUM_I AND

FS.INTEGRATION_ID = ORG.INTEGRATION_ID

The same SQL re-coding method can be used for other Source Qualifiers that run into spool space problems.

In addition, there is another solution to the lack of spool space issue. This solution is to convert the NULLs to some evenly distributed but non-matched values, as shown in the following SQL example:

SELECT ... FROM

W_PER_RANK_FS FS LEFT OUTER JOIN W_ORG_D ORG ON

CASE WHEN FS.ACCNT_ID IS NOT NULL

THEN FS.ACCNT_ID

ELSE '#' || FS.INTEGRATION_ID END

= ORG.INTEGRATION_ID

AND FS.DATASOURCE_NUM_ID = ORG.DATASOURCE_NUM_ID

NOTE: The re-coded SQL is shown in bold type.

Database Statistics

Additional Statistics Required

Oracle Business Intelligence provides a list of column and index statistics collection statements. These are generally applicable for all situations, but should be evaluated for each site and situation. Additional statistics might be required, depending on your requirements. For example, some workflows need more statistics to avoid an out of spool space error on Teradata server.

Collect Statistics As A Pre-requisite

Once the tables have been created in the staging and target databases, you must run the supplied statistics collection. Failure to do so can affect ETL performance, and possibly result in a spool space error (error number 2646)

The DAC does the re-collection of statistics as part of the ETL processes. However, the DAC issues the collect statistics statement at the table level only (for example, collect statistics on w_org_d), and only for existing statistics.

Group By Vs Distinct

When there is a low number of distinct values, it is more efficient to use the GROUP BY phrase. Do not use the DISTINCT phrase, unless the number of distinct values is high.

Pruning Of Mapping

If you do not use all of the OTB fields supplied, you can improve performance by stripping the extraneous fields from the mapping and the tables.

Loader Configurations

This section explains the loaders that are available in Teradata, and how they are used in Oracle BI Applications.

Teradata has 3 different types of Teradata loader processes, as follows:

Fastload
Mload
Tpump

Each loader process can be used in two different modes, as follows:

Staged Mode: The Informatica process does the following in this order:
- Reads from the source data.
- Creates a data file.
- Invokes the loader process to load the table using the data file created.
  Advantages: In the event of failures, you can recover using the Teradata recovery process.
  
  Disadvantages: Staged mode is slower than Piped mode, and you need more disk space, as it can create large data files.
Piped Mode: The Informatica process reads from the source and simultaneously pipes that data to the loader to start loading the target table.
Advantages: Quicker than Staged mode, and you do not require large amounts of disk space because no data files are created.

Disadvantages: In the event of failures, you cannot recover using the Teradata recovery process (because tpump does row commits unlike fastload and mload).

Fastoad

You use the Fastload process on empty tables, such as loading staging tables and in initial loads where the tables are empty.

When the Fastload process starts loading, it locks the target table, which means that processes (for example, lookups) cannot access that table. One solution to this problem is to specify dummy SQL for the look up overrides at the session level.

TIP: If a session fails during a Fastlload process, use SQL Assistant to run a simple SQL command (for example, count(*)), to determine whether the table is locked by a Fastload process.

If a table is locked (for example, (for W_ORG_DS), use the following script to release the lock:

LOGON SDCNCR1/Siebel_qa1,sqa1;

BEGIN LOADING Siebel_qa1.W_ORG_DS

ERRORFILES Siebel_qa1.ET_W_ORG_DS,siebel_qa1.UV_W_ORG_DS;

END LOADING;

If you save the above text in a file called test.ctl, you would run this process by entering the following command at a command prompt:

C:\fastload\test.ctl

TIP: To create a load script for a table, edit the test.ctl script above to change the login information, and replace all occurrences of W_ORG_DS with the required target table name.

After a load process script runs successfully, you should be able to run the command 'select count(*)' on the target table. If you are not able release the lock, you might need to drop and re-create the table to remove the lock. If you do so, you must re-create the statistics.

TIP: Fastload is typically used in piped mode to load staging tables and initial loads. In the event of errors, reload the entire data.

Mload

The Mload process is slower than Fastload but quicker than Tpump. The Mload process can work on both empty tables as well as on tables with data. In the event of errors when running in piped mode, you cannot recover the data.

Tpump

The Tpump process is slower than Mload but faster than ODBC. The Tpump process does row commits, which enables you to recover processed operations, even if you use piping mode. In other words, if you re-start the process, Tpump starts loading data from the last committed data.

Tpump can be used in the following modes:

Tpump_Insert : Use to do inserts.
Tpump_Update : Use to do updates (this mode requires you to define the primary key in the Informatica target table definition).
Tpump_Upsert : Use to do update otherwise insert (this mode requires you to define the primary key in the Informatica target table definition).
Tpump_Delete: Use to do deletes (this mode requires you to define the primary key in the Informatica target table definition).

Informatica uses the the actual target table name to generate the error table and log tables to be used as part of its control file generation. If you have two instances of Tpump loading into same target table at the same time, you need to modify the session to use a different error table and log table name.

The Tpump load process in piped mode is useful for incremental loads, and where the table is not empty. In the event of errors, restart the process and it starts re-loading from the last committed data.

Refer to Informatica documentation for information about configuring a session to use Teradata loaders.

Oracle® Business Intelligence Applications Installation and Configuration Guide		Copyright © 2007, Oracle. All rights reserved.