8 Monitoring and Troubleshooting

Performance tuning for an Oracle Real Application Clusters (Oracle RAC) database is very similar to performance tuning for a single-instance database. Many of the tuning tasks that you perform on single-instance Oracle databases can also improve performance of your Oracle RAC database. This chapter focuses on the performance tuning and monitoring tasks that are unique to Oracle RAC. For information about general performance tuning, refer to Oracle Database 2 Day DBA.

This chapter includes the following sections:

See Also:

Oracle Database Oracle Clusterware and Oracle Real Application Clusters Administration and Deployment Guide for more information about diagnosing problems for Oracle Clusterware and Oracle Real Application Clusters components

Using Performance Views in Oracle Real Application Clusters

Oracle RAC contains a set of underlying views that are maintained by the database and accessible to the database administrator user SYS. These views are called dynamic performance views because they are continuously updated while a database is running and in use, and their contents relate primarily to performance.

Although these views appear to be standard database tables, they are not. These views provide data on internal disk structures and memory structures. You can select from these views, but you can never update or alter them.

GV$ Views

Each instance has a set of instance-specific views, which are prefixed with V$. You can also query global dynamic performance views to retrieve performance information from all the qualified instances. Global dynamic performance view names are prefixed with GV$.

Querying a GV$ view retrieves the V$ view information from all qualified instances. In addition to the V$ information, each GV$ view contains an extra column named INST_ID of data type NUMBER. The INST_ID column displays the instance number from which the associated V$ view information was obtained.

The INST_ID column can be used as a filter to retrieve V$ information from a subset of available instances. For example, the following query retrieves the information from the V$LOCK view for instances 2 and 5:

SQL> SELECT * FROM GV$LOCK WHERE INST_ID = 2 OR INST_ID = 5;

Monitoring Oracle RAC Database Performance

Both Oracle Enterprise Manager Database Control and Oracle Enterprise Manager Grid Control are cluster-aware and provide a central console to manage your cluster database. From the Cluster Database page, you can do all of the following:

  • View the overall system status, such as the number of nodes in the cluster database and their current status. This high-level view capability means that you do not have to access each individual database instance for details if you just want to see inclusive, aggregated information.

  • View alert messages aggregated across all the instances with lists for the source of each alert message. An alert message is an indicator that signifies that a particular metric condition has been encountered. A metric is a unit of measurement used to report the system's conditions.

  • Monitor performance metrics aggregated across all the instances or displayed side by side so you can readily compare instances.

  • Monitor cluster cache coherency statistics to help you identify processing trends and optimize performance for your Oracle RAC environment. Cache coherency statistics measure how well the data in caches on multiple instances is synchronized. If the data caches are completely synchronized with each other, then reading a memory location through the cache on any instance will return the most recent data written to that location through any cache on any instance.

Enterprise Manager accumulates data over specified periods of time, called collection-based data. Enterprise Manager also provides current data, known as real-time data. The following sections explain how to monitor both types of data:

Using the Cluster Database Performance Page

The Cluster Database Performance page provides a quick glimpse of the performance statistics for a database. Statistics are rolled up across all the instances in the cluster database. Using links that are next to the charts at the bottom of the page, you can get more specific information, allowing you to perform any of the following tasks:

  • Identify the causes of performance issues.

  • Decide whether resources need to be added or redistributed.

  • Tune your SQL plan and schema for better optimization.

  • Resolve performance issues.

The following screenshot shows a partial view of the Cluster Database Performance page. You access this page by clicking the Performance tab from the Cluster Database Home page.

Description of cldbperf001.gif follows
Description of the illustration cldbperf001.gif

Reviewing the Chart for Cluster Host Load Average

The Cluster Host Load Average chart in the Cluster Database Performance page shows potential problems that are outside the database. The chart shows maximum, average, and minimum load values for available hosts for the previous hour.

If the load average is higher than the average of the total number of CPUs across all the hosts in the cluster, then too many processes are waiting for CPU resources. SQL statements that are not tuned often cause high CPU usage. Compare the load average values with the values displayed for CPU Used in the Average Active Sessions chart. If the sessions value is low and the load average value is high, then this indicates that something else on the host, other than your database, is consuming the CPU.

Reviewing the Chart for Global Cache Block Access Latency

Each cluster database instance has its own buffer cache in their System Global Areas (SGAs). Using Cache Fusion, Oracle RAC environments logically combine each instance's buffer cache to enable the database instances to process data as if the data resided on a logically combined, single cache.

When a process attempts to access a data block, it first tries to locate a copy of the data block in the local buffer cache. If a copy of the data block is not found in the local buffer cache, a global cache operation is initiated. Before reading a data block from disk, the process attempts to find the data block in the buffer cache of another instance. If the data block is in the buffer cache of another instance, Cache Fusion transfers a version of the data block to the local buffer cache, rather than having one database instance write the data block to disk and requiring the other instance to reread the data block from disk. For example, after the sales1 instance loads a data block into its buffer cache, the sales2 instance can more quickly acquire the data block from the sales1 instance by using Cache Fusion rather than by reading the data block from disk.

The Global Cache Block Access Latency chart shows data for two different types of data block requests: current and consistent-read (CR) blocks. When you update data in the database, Oracle Database must locate the most recent version of the data block that contains the data, which is called the current block. If you perform a query, only data committed before the query began is visible to the query. Data blocks that were changed after the start of the query are reconstructed from data in the undo segments, and the reconstructed data is made available to the query in the form of a consistent-read block.

The Global Cache Block Access Latency chart on the Cluster Database Performance page shows the latency for each type of data block request, or the elapsed time it takes to locate and transfer each consistent read and current blocks between the buffer caches.

If the Global Cache Block Access Latency chart shows high latencies, this can be caused by any of the following:

  • A high number of requests caused by SQL statements that are not tuned.

  • A large number of processes in the queue waiting for CPU, or scheduling delays.

  • Slow, busy, or faulty interconnects. In these cases, check your network connection for dropped packets, retransmittals, or cyclic redundancy check (CRC) errors.

Concurrent read and write activity on shared data in a cluster is a frequently occurring activity. Depending on the service requirements, this activity does not usually cause performance problems. However, when global cache requests cause a performance problem, optimizing SQL plans and the schema to improve the rate at which data blocks are located in the local buffer cache, and minimizing I/O is a successful strategy for performance tuning. If the latency for consistent-read and current block requests reaches 10 milliseconds, then your first step in resolving the problem should be to go to the Cluster Cache Coherency page for more detailed information.

Reviewing the Cluster Cache Coherency Page

To access the Cluster Cache Coherency page, click Cluster Cache Coherency in the Additional Monitoring Links section of the Cluster Database Performance page. You can alternatively click either of the legends to the right of the Global Cache Block Access Latency chart.

The Cluster Cache Coherency page appears, as shown in the following screenshot. This page contains summary charts for cache coherency metrics for the cluster.

Description of cldbperf002.gif follows
Description of the illustration cldbperf002.gif

Table 8-1 provides a description of the Cluster Cache Coherency charts and the actions to perform to access more comprehensive information for problem resolution.

Table 8-1 Cluster Cache Coherency Charts

Name Description

Global Cache Block Access Latency

Shows the total elapsed time, or latency, for a block request. Click one of the legends to the right of the chart to view the average time it takes to receive data blocks for each block type (current or CR) by instance. On the Average Block Receive Time by Instance page, you can click an instance legend under the chart to go to the Block Transfer for Local Instance page, where you can identify which block classes, such as undo blocks, data blocks, and so on, are subject to intense global cache activity. This page displays the block classes that are being transferred, and which instances are transferring most of the blocks.Cache transfer indicates how many current and CR blocks per block class were received from remote instances, including how many transfers incurred a delay (busy) or an unexpected longer delay (congested).

Global Cache Block Transfer Rate

Shows the total aggregated number of blocks received by all instances in the cluster by way of an interconnect. Click one of the legends to the right of the chart to go to the Global Cache Blocks Received by Instance page for that type of block. From there, you can click an instance legend under the chart to go to the Segment Statistics by Instance page, where you can see which segments are causing cache contention.

Global Cache Block Transfers and Physical Reads

Shows the percentage of logical read operations that retrieved data from the buffer cache of other instances by way of Direct Memory Access and from disk. It is essentially a profile of how much work is performed in the local buffer cache, rather than the portion of remote references that incur some latency overhead. Click the legends to the right of the chart to go to the Global Cache Block Transfers vs. Logical Reads by Instance and Physical Reads vs. Logical Reads by Instance pages. From there, you can click an instance legend under the chart to go to the Segment Statistics by Instance page, where you can see which segments are causing cache contention.


Reviewing the Chart for Average Active Sessions

The Average Active Sessions chart in the Cluster Database Performance page shows potential problems inside the database. Categories, called wait classes, show how much of the database is waiting for a resource, such as CPU or disk I/O. Comparing CPU time to wait time helps to determine how much of the response time is consumed with useful work rather than waiting for resources that are potentially held by other processes.

The chart displays the load on the instance and identifies performance issues. At the cluster database level, this chart shows the aggregate wait class statistics across all the instances.

Compare the peaks on the Average Active Sessions chart with those on the Database Throughput charts. If the Average Active Sessions chart displays a large number of sessions waiting, indicating internal contention, but throughput is high, then the situation may be acceptable. The database is probably also performing efficiently if internal contention is low but throughput is high. However, if internal contention is high and throughput is low, then consider tuning the database.

If you click the wait class legends beside the Average Active Sessions chart, you can view instance-level information stored in Active Sessions by Instance pages. These pages show the service times for up to four instances. Using these pages, if you need to diagnose and fix problems that are causing the higher number of wait events in a specific category, you can select an instance of interest and view the wait events, as well as the SQL, sessions, services, modules, and actions that are consuming the most database resources.

See Also:

Oracle Database 2 Day DBA for more information about tuning a database and instance

Reviewing the Charts for Database Throughput

The Database Throughput charts summarize any contention that appears in the Average Active Sessions chart, and also show how much work the database is performing on behalf of the user or applications. The Per Second view is for databases that handle SQL queries, shown as Physical Reads in the bottom chart. The Per Transaction view is for databases that handle transactions, shown as Transactions in the top chart and Redo Size in the bottom chart. Logons show how many users are logged on to the database per second.

You can also obtain information at the instance level by clicking one of the legends to the right of the charts to access the Database Throughput by Instance page. This page shows the breakdown for all active instances of the aggregated Database Throughput chart on the Cluster Database Performance page. You can use this page to view the throughput for a particular instance, which may help you diagnose throughput problems.

You can drill down the list farther to see the sessions of an instance consuming the greatest resources. Click an instance name legend just under the chart to go to the Top Sessions page. For more information about this page, refer to the Enterprise Manager Help system.

Reviewing Other Performance-Related Charts

In the Additional Monitoring Links and Additional Instance Monitoring Links sections of the Cluster Database Performance page, there are links to other charts that are useful in evaluating the performance of your cluster database. This section describes the following charts:

Top Consumers Page

The Top Consumers page provides access to several tabs that enable you to view real-time or collection-based data for the services, modules, clients, and actions that are consuming the most system resources. You access the Top Consumers page by clicking Top Consumers in the Additional Monitoring Links section of the Cluster Database Performance page.

By default, the Top Consumers page initially displays the Overview tab, which shows aggregated summary data for the highest resource consumers. For instance-level information about a consumer, click the portion of a chart representing the consumer or click the link under the chart for that consumer. The page that appears shows the running instances that are serving the consumer. You can expand the names in the Module column to show data for individual instances.

Top Sessions Page

The Top Sessions page shows a real-time summary list of sessions based on aggregated data. You can see which sessions have consumed the greatest amount of system resources, referred to as the top sessions, and then decide whether or not you want to stop the sessions. You access the Top Sessions page by clicking Top Sessions in the Additional Monitoring Links section of the Cluster Database Performance page.

Instance Activity Page

The Instance Activity page enables you to view instance activity for several metrics within general metric categories, such as cursors, transactions, sessions, logical I/O, physical I/O, and net I/O. You can access top sessions statistics for a particular metric by clicking a metric legend under the chart if in Graphic mode, or by clicking a name in the summary table if in Tabular mode. You can also use the Switch Database Instance menu to toggle from one database instance to another. You can view data on a per-second or per-transaction basis. You access this page by clicking Instance Activity in the Additional Instance Monitoring Links section at the bottom of the Cluster Database Performance page.

Top Segments Page

Collecting and viewing segment-level statistics is an effective method for identifying frequently accessed tables or indexes in a database. The Top Segments page enables you to gather segment-level statistics to identify performance problems associated with individual segments. This page is particularly useful for Oracle RAC, because it also tracks the number of consistent read and current blocks received by an object. A high number of current blocks received plus a high number of buffer waits may indicate potential contention.

You access the Top Segments page by clicking Top Segments in the Additional Monitoring Links section. You can view segments for all instances, or use a filter to see segments for a specific instance.

Database Locks Page

Use the Database Locks page to determine if multiple instances are holding locks for the same object. The page shows user locks, all database locks, or locks that are blocking other users or applications. You can use this information to stop a session that is unnecessarily locking an object. You access the Database Locks page by clicking Database Locks in the Additional Monitoring Links section.

Using the Automatic Workload Repository in Oracle RAC Environments

In Oracle RAC environments, each Automatic Workload Repository (AWR) snapshot captures data from all active instances within the cluster. The data for each snapshot set that is captured for all active instances is from the same point in time. In addition, the data for each instance is stored separately and is identified with an instance identifier. For example, the buffer_busy_wait statistic shows the number of buffer wait events on each instance. AWR does not store data that is aggregated from across the entire cluster. In other words, the data is stored for each individual instance.

AWR automatically generates snapshots of the performance data once every hour and collects the statistics in the workload repository. You can also manually create snapshots, but this is usually not necessary. The data in the snapshot interval is then analyzed by the Automatic Database Diagnostic Monitor (ADDM).

See Also:

Oracle Database Performance Tuning Guide for more information about Automatic Workload Repository

Troubleshooting Configuration Problems in Oracle RAC Environments

Problems can occur when attempting to complete the installation or database creation process manually instead of using the Oracle Database management tools. Other problems occur due to the database administrator or system administrator missing important operating system or cluster configuration steps prior to installation. Both Oracle Clusterware and Oracle Database components have subcomponents that you can troubleshoot. The crsctl command check enables you to determine the status of several Oracle Clusterware components at one time.

This section contains the following topics:

Using CRSCTL to Diagnose Cluster Issues

You can use crsctl commands as the root operating system user to diagnose problems with your Oracle Clusterware installation, or to enable dynamic debugging for Oracle Clusterware. This section contains the following topics:

Obtaining Oracle Clusterware Component Names

Use the following command to obtain component names, where module_name is crs, evm, css or the name of the module:

# crsctl lsmodules module_name

For example, viewing the components of the css module might return the following results:

# crsctl lsmodules css
The following are the CSS modules :: 
CSSD
COMMCRS
COMMNS

Enabling Debugging of Oracle Clusterware Components

You can enable debugging for the Oracle Cluster daemons, Event Manager (EVM), and their modules by running crsctl commands as follows, where component_name is the name of an Oracle Clusterware component for which you want to enable debugging, such as crs, evm, or css, module is the name of module as it appears in the output for the crcstl lsmodules command, and debugging_level is a number from 1 to 5:

# crsctl debug log component module:debugging_level

For example, to enable tracing for the CSSD module of the css component, you could use the following command:

# crsctl debug log css CSSD:1

Enabling and Disabling Oracle Clusterware Daemons

When the Oracle Clusterware daemons are enabled, they start automatically when the node is started. To prevent the daemons from starting automatically, you can disable them using crsctl commands.

Run the following command to enable startup for all the Oracle Clusterware daemons:

# crsctl enable crs

Run the following command to disable the startup of all the Oracle Clusterware daemons:

# crsctl disable crs

Note:

The crsctl enable crs and crsctl disable crs commands are not supported on Microsoft Windows platforms.

Locating the Oracle Clusterware Alert Log

Oracle Clusterware posts alert messages when important events occur. For example, you might see alert messages from the Cluster Ready Services (CRS) daemon process when it starts, if it aborts, if the failover process fails, or if automatic restart of a CRS resource failed.

The location of the Oracle Clusterware log file is CRS_home/log/hostname/alerthostname.log, where CRS_home is the directory in which Oracle Clusterware was installed and hostname is the host name of the local node.

Enabling Debugging for an Oracle Clusterware Resource

You can use crsctl commands to enable resource debugging using the following syntax, where resource_name is the name of an Oracle Clusterware resource, such as ora.docrac1.vip, and debugging_level is a number from 1 to 5:

# crsctl debug log res resource_name:debugging_level

To obtain a list of the resources available for debugging, use the following command:

# crs_stat

Note:

When you enable debugging for an Oracle Clusterware resource using crsctl commands, this has the same effect as if you set the operating system environment variable USER_ORA_DEBUG to 1 before running the start, stop, or check action scripts for the specified resource.

Checking the Status of the Oracle Clusterware Installation

Use the crsctl check command to determine the condition of your clusterware installation, as shown in the following example:

# crsctl check crs

This command displays the status of the Cluster Synchronization Services (CSS), Event Manager (EVM), and the Cluster Ready Services (CRS) daemons. You can also check the status of an individual daemon using the following syntax, where daemon is one of crsd, cssd, or evmd:

# crsctl check daemon

Locating the Oracle Clusterware Component Log Files

Oracle RAC uses a unified log directory structure to store all the Oracle Clusterware component log files. This consolidated structure simplifies diagnostic information collection and assists during data retrieval and problem analysis.

The log files for the CRS daemon, crsd, can be found in the following directory:

CRS_home/log/hostname/crsd/

The log files for the CSS deamon, cssd, can be found in the following directory:

CRS_home/log/hostname/cssd/

The log files for the EVM deamon, evmd, can be found in the following directory:

CRS_home/log/hostname/evmd/

The log files for the Oracle Cluster Registry (OCR) can be found in the following directory:

CRS_home/log/hostname/client/

The log files for the Oracle RAC high availability component can be found in the following directories:

CRS_home/log/hostname/racg/
$ORACLE_HOME/log/hostname/racg

Note:

Each program that is part of the Oracle RAC high availability component has a subdirectory assigned exclusively for that program. The name of the program subdirectory is the same as the name of the program.

If any of the Oracle Clusterware components generates a core dump file, it is located is a subdirectory of the log directory for that component.

Running the Oracle Clusterware Diagnostics Collection Script

Run the diagcollection.pl script as the root user to collect diagnostic information from an Oracle Clusterware installation. The diagnostics provide additional information so that Oracle Support Services can resolve problems. Run this script from the operating system prompt as follows, where CRS_home is the home directory of your Oracle Clusterware installation:

# CRS_home/bin/diagcollection.pl --collect

This command displays the status of the Cluster Synchronization Services (CSS), Event Manager (EVM), and the Cluster Ready Services (CRS) daemons.

Using the Cluster Verification Utility to Diagnose Problems

The Cluster Verification Utility (CVU) can assist you in diagnosing a wide variety of configuration problems. Refer to the example of using the CVU in "Installing Oracle Clusterware 10g".

This section contains the following topics:

Enabling Tracing

You can enable tracing by setting the environment variable SRVM_TRACE to true. After setting this variable to true, run the command that you want to trace. The CVU trace files are created in the CRS_HOME/cv/log directory. Oracle RAC automatically rotates the log files, and the most recently created log file has the name cvutrace.log.0. You should remove unwanted log files or archive them to reclaim disk space, if needed. The CVU does not generate trace files unless you enable tracing.

Checking the Settings for the Interconnect

Cache Fusion enhances the performance of Oracle RAC by utilizing a high-speed interconnect to send data blocks to another instance's buffer cache. The high-speed interconnect should be a private network with the highest bandwidth to maximize performance.

For network connectivity verification, the CVU discovers all the available network interfaces if you do not specify an interface on the CVU command line.

To verify the accessibility of the cluster nodes from the local node or from any other cluster node, use the component verification command nodereach as follows:

cluvfy comp nodereach -n node_list [ -srcnode node ] [-verbose]

To verify that the other cluster nodes can be reached from the local node through all the available network interfaces or through specific network interfaces, use the component verification command nodecon as follows:

cluvfy comp nodecon -n node_list [ -i interface_list ] [-verbose]

You can also use the nodecon command without the -i option, as shown in the following example:

cluvfy comp nodecon -n all [-verbose]

When you issue the nodecom command as shown in the previous example, it instructs the CVU to perform the following tasks:

  • Discover all the network interfaces that are available on the cluster nodes.

  • Review the corresponding IP addresses and subnets for the interfaces.

  • Obtain the list of interfaces that are suitable for use as VIPs and the list of interfaces to private interconnects.

  • Verify the connectivity among all the nodes through those interfaces.

You can run the nodecon command in verbose mode to identify the mappings between the interfaces, IP addresses, and subnets. To verify the connectivity among the nodes through specific network interfaces, use the comp nodecon command with the -i option. For example, you can verify the connectivity among the nodes docrac1, docrac2, and docrac3, through interface eth0 by running the following command:

cluvfy comp nodecon -n docrac1, docrac2, docrac3 -i eth0 -verbose

Troubleshooting a Node with Status of UNKNOWN

If you run the cluvfy command using the -verbose argument and the CVU responds with UNKNOWN for a particular node, then this is because the CVU cannot determine whether a check passed or failed. The cause of this could be because a node is not reachable, or as a result of any system problem that was occurring on that node at the time that the CVU was performing a check.

The following is a list of possible causes for an UNKNOWN response:

  • The node is down.

  • Executable files that the CVU requires are missing in the CRS_home/bin directory or the $ORACLE_HOME/bin directory.

  • The user account that ran the CVU does not have privileges to run common operating system executable files on the node.

  • The node is missing an operating system patch or required package.

  • The kernel parameters on that node were not configured correctly and the CVU cannot obtain the operating system resources required to perform its checks.

See Also:

Oracle Database Oracle Clusterware and Oracle Real Application Clusters Installation Guide for Linux, or for a different operating system, for more information about using the CVU to troubleshoot your Oracle Clusterware setup

Verifying the Existence of Node Applications

To verify the existence of node applications, namely the virtual IP (VIP), Oracle Notification Services (ONS), and Global Service Daemon (GSD), on all the nodes, use the CVU comp nodeapp command, using the following syntax:

cluvfy comp nodeapp [ -n node_list] [-verbose]

Verifying the Integrity of Oracle Clusterware Components

To verify the existence of all the Oracle Clusterware components, use the component verification comp crs command, using the following syntax:

cluvfy comp crs [ -n node_list] [-verbose]

Verifying the Integrity of the Oracle Cluster Registry

To verify the integrity of the Oracle Cluster Registry, use the component verification comp ocr command, using the following syntax:

cluvfy comp ocr [ -n node_list] [-verbose]

Verifying the Integrity of Your Entire Cluster

To verify that all nodes in the cluster have the same view of the cluster configuration, use the component verification comp clu command, as follows:

cluvfy comp clu

Viewing Oracle RAC Database Alerts

Alert messages are displayed in Enterprise Manager on the Cluster Database Home page under the Alerts heading. The section Related Alerts displays nondatabase alert messages, for example, alert messages for Oracle Net.

The Alerts table is similar to that shown for single-instance databases, but in a cluster database, it includes columns for the target name and target type. For example, if a user connected to the sales1 instance exceeded his allotted login time, you would see an alert message with the following values:

  • Target name: sales_sales1

  • Target type: Database instance

  • Category: Response

  • Name: User logon time

  • Message: User logon time is 10250 microseconds

  • Alert triggered: Date and time when the alert condition occurred

The following screenshot shows an example of the Alerts display for a clustered database named docrac.

Description of racalert.gif follows
Description of the illustration racalert.gif

Viewing Oracle RAC Database Alert Log Messages

The alert log is associated with an instance. To view the alert log for your cluster database, you must use the Cluster Database Instance page in Enterprise Manager. On the Home page, in the Diagnostic Summary section, click the date string link next to the heading Alert Log to display the alert log entries containing ORA- errors. To view all the entries in the alert log, click Alert Log Content in the Related Links section on the Alert Log Errors page. Enterprise Manager displays the most recent alert log entries by default, but you can specify search criteria to display to alert log entries for a range of dates.

The following screenshot shows an example of the alert log entries for the docrac1 instance of a cluster database named docrac.

Description of racalert2.gif follows
Description of the illustration racalert2.gif