15
Recovery Manager Troubleshooting

This chapter describes how to troubleshoot Recovery Manager. This chapter contains these topics:

Interpreting RMAN Message Output

Recovery Manager provides detailed error messages that can aid in troubleshooting problems. Also, the Oracle database server and third-party media vendors generate useful debugging output of their own. This section contains these topics:

Identifying Types of Message Output

Output that is useful for troubleshooting failed RMAN jobs is located in several different places, as explained in the following table.

Type of Output	Produced By	Location	Description
RMAN messages	RMAN	Direct this output to: Standard output (typically the terminal) A log file specified by `LOG` on the command line or the `SPOOL` `LOG` command A file created by redirecting RMAN output by means of command line options	Contains actions relevant to the RMAN job as well as error messages generated by RMAN, the server, and the media vendor. RMAN error messages have an `RMAN-`xxxxx prefix. Normal action descriptions do not have a prefix.
`alert_`SID`.log`	Oracle database server	The directory named in the `USER_DUMP_DEST` initialization parameter.	Contains a chronological log of errors, initialization parameter settings, and administration operations. Records values for overwritten control file records (refer to "Monitoring the Overwriting of Control File Records").
Oracle trace file	Oracle database server	The directory specified in the `USER_DUMP_DEST` initialization parameter.	Contains detailed output generated by Oracle server processes. This file is created when an `ORA-600` or `ORA-3113` error message occurs, whenever RMAN cannot allocate a channel, and when Oracle fails to load the media management library.
`sbtio.log`	Third-party media management software	The directory specified in the `USER_DUMP_DEST` initialization parameter.	Contains vendor-specific information written by the media management software. Note that this log does not contain Oracle server or RMAN errors.
Media manager log file	Third-party media management software	The filenames for any media manager logs other than `sbtio.log` are determined by the media management software.	Contains information on the functioning of the media management device.

Recognizing RMAN Error Message Stacks

On various occasions it may be important for you to determine whether RMAN successfully executed a command. For example, if you are trying to write a script that performs an unattended backup using RMAN, you may want to know whether the backup was a success or failure.

One way to determine whether RMAN encountered an error is to examine its return code, as described in "Identifying RMAN Return Codes". A second way is to search the Recovery Manager output for the string RMAN-00569, which is the message number for the error stack banner. All RMAN errors are preceded by this error message. If you do not see an RMAN-00569 message in the output, then there are no errors. Following is sample output for a syntax error:

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-00558: error encountered while parsing input commands
RMAN-01005: syntax error: found "}": expecting one of: "archivelog, backup, backupset, 
channel, comma, controlfilecopy, current, database, datafile, datafilecopy, delete, 
diskratio, filesperset, format, include, (, parms, pool, ;, skip, setsize, tablespace, 
tag"
RMAN-01007: at line 1 column 58 file: standard input

Identifying Error Codes

Typically, you find the following types of error codes in RMAN message stacks:

Errors prefixed with RMAN-
Errors prefixed with ORA-
Errors preceded by the line Additional information:

See Also:
Oracle9i Database Error Messages for explanations of RMAN and ORA error codes

RMAN Error Message Numbers

Table 15-1 indicates the error ranges for common RMAN error messages, all of which are described in Oracle9i Database Error Messages.

Table 15-1 RMAN Error Message Ranges

Error Range	Cause
0550-0999	Command-line interpreter
1000-1999	Keyword analyzer
2000-2999	Syntax analyzer
3000-3999	Main layer
4000-4999	Services layer
5000-5499	Compilation of `RESTORE` or `RECOVER` command
5500-5999	Compilation of `DUPLICATE` command
6000-6999	General compilation
7000-7999	General execution
8000-8999	PL/SQL programs
9000-9999	Low-level keyword analyzer
10000-10999	Server-side execution
11000-11999	Interphase errors between PL/SQL and RMAN
12000-12999	Recovery catalog packages

Media Manager Error Numbers

When errors occur through the media management API, RMAN returns an error message number prefixed as follows:

Additional information:

Table 15-2 lists media manager message numbers and their corresponding error text. In the error codes, O/S stands for operating system. The errors prefixed with an asterisk are internal and should never be seen during normal operation.

Table 15-2 Media Manager Error Message Ranges (Page 1 of 2)

Cause	No.	Message
sbtopen	7000 7001 7002* 7003 7004 7005 7006 7007 7008 7009 7010 7011 7012*	Backup file not found (only returned for read) File exists (only returned for write) Bad mode specified Invalid block size specified No tape device found Device found, but busy; try again later Tape volume not found Tape volume is in-use I/O Error Can't connect with Media Manager Permission denied O/S error for example malloc, fork error Invalid argument(s) to sbtopen
sbtclose	7020* 7021* 7022 7023 7024* 7025	Invalid file handle or file not open Invalid flags to sbtclose I/O error O/S error Invalid argument(s) to sbtclose Can't connect with Media Manager
sbtwrite	7040* 7041 7042 7043 7044*	Invalid file handle or file not open End of volume reached I/O error O/S error Invalid argument(s) to sbtwrite
sbtread	7060* 7061 7062 7063 7064 7065*	Invalid file handle or file not open EOF encountered End of volume reached I/O error O/S error Invalid argument(s) to sbtread
sbtremove	7080 7081 7082 7083 7084 7085 7086*	Backup file not found Backup file in use I/O Error Can't connect with Media Manager Permission denied O/S error Invalid argument(s) to sbtremove
sbtinfo	7090 7091 7092 7093 7094 7095*	Backup file not found I/O Error Can't connect with Media Manager Permission denied O/S error Invalid argument(s) to sbtinfo
sbtinit	7110* 7111	Invalid argument(s) to sbtinit O/S error

Cause

No.

Message

sbtopen

7000

7001

7002*

7003

7004

7005

7006

7007

7008

7009

7010

7011

7012*

Backup file not found (only returned for read)

File exists (only returned for write)

Bad mode specified

Invalid block size specified

No tape device found

Device found, but busy; try again later

Tape volume not found

Tape volume is in-use

I/O Error

Can't connect with Media Manager

Permission denied

O/S error for example malloc, fork error

Invalid argument(s) to sbtopen

sbtclose

7020*

7021*

7022

7023

7024*

7025

Invalid file handle or file not open

Invalid flags to sbtclose

I/O error

O/S error

Invalid argument(s) to sbtclose

Can't connect with Media Manager

sbtwrite

7040*

7041

7042

7043

7044*

Invalid file handle or file not open

End of volume reached

I/O error

O/S error

Invalid argument(s) to sbtwrite

sbtread

7060*

7061

7062

7063

7064

7065*

Invalid file handle or file not open

EOF encountered

End of volume reached

I/O error

O/S error

Invalid argument(s) to sbtread

sbtremove

7080

7081

7082

7083

7084

7085

7086*

Backup file not found

Backup file in use

I/O Error

Can't connect with Media Manager

Permission denied

O/S error

Invalid argument(s) to sbtremove

sbtinfo

7090

7091

7092

7093

7094

7095*

Backup file not found

I/O Error

Can't connect with Media Manager

Permission denied

O/S error

Invalid argument(s) to sbtinfo

sbtinit

7110*

7111

Invalid argument(s) to sbtinit

O/S error

Interpreting RMAN Error Stacks

Sometimes you may find it difficult to identify the useful messages in the RMAN error stack. Note the following tips and suggestions:

Because most of the messages in the error stack are not meaningful for the purposes of troubleshooting, try to identify the one or two errors that are most important.
Check for a line that says Additional information followed by an integer. This line indicates a media management error. The integer that follows refers to a code that is explained in the text of the error message.
Read the messages from the bottom up, because this is the order in which RMAN issues the messages. The first one or two errors issued are usually the most informative.
Identify the basic type of error according to the error range chart in Table 15-1 and then refer to Oracle9i Database Error Messages for information on the most important messages.

Interpreting RMAN Errors: Example

You attempt the following backup of tablespace tbs_99 and receive the following message:

RMAN> BACKUP TABLESPACE tbs_99;

allocated channel: c1
channel c1: sid=8 devtype=DISK

RMAN-03026: error recovery releasing channel resources
RMAN-08031: released channel: c1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure during compilation of command
RMAN-03013: command type: backup
RMAN-06038: recovery catalog package detected an error
RMAN-20202: tablespace not found in the recovery catalog
RMAN-06019: could not translate tablespace name "TBS_99"

You read the last two messages in the stack first and immediately see the problem: no tablespace called tbs_99 appears in the recovery catalog. You conclude that tbs_99 does not exist in the database or the recovery catalog has been resynchronized to include this information.

Interpreting Server Errors: Example

Assume that you attempt to recover a tablespace and receive the following errors:

RMAN> RECOVER TABLESPACE tbs_5;

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure during compilation of command
RMAN-03013: command type: recover
RMAN-03006: non-retryable error occurred during execution of command: recover(3)
RMAN-12004: unhandled exception during command execution on channel default
RMAN-10032: unhandled exception during execution of job step 1: ORA-00283: recovery 
            session canceled due to errors
RMAN-11003: failure during parse/execution of SQL statement: alter database recover if 
            needed tablespace TBS_5
RMAN-11001: Oracle Error: ORA-00283: recovery session canceled due to errors
ORA-01124: cannot recover data file 21 - file is in use or recovery
ORA-01110: data file 21: '/ade/lashdown_main/oracle/dbs/tbs_53.f'

As suggested, you start reading from the bottom up. The ORA-01110 message explains there was a problem with the recovery of datafile tbs_53.f. The second error indicates that Oracle cannot recover the datafile because it is in use or already being recovered. The remaining RMAN errors indicate that the recovery session was cancelled due to the server errors. Hence, you conclude that because you were not already recovering this datafile, the problem must be that the datafile is online and you need to take it offline and restore a backup.

Interpreting Media Management Errors: Example

Media management errors in RMAN message output are not uncommon. Assume that you use a tape drive and receive the following output during a backup job:

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03007: retryable error occurred during execution of command: allocate
RMAN-07004: unhandled exception during command execution on channel c4
RMAN-10032: unhandled exception during execution of job step 4: ORA-06512: at line 158 
RMAN-10035: exception raised in RPC: ORA-19624: operation failed, retry possible 
ORA-19506: failed to create sequential file, name="df_99_1", parms=""
ORA-27007: failed to open file
HP-UX Error: 1003: Unknown system error
Additional information: 7004
Additional information: 1
ORA-06512: at "SYS.DBMS_BACKUP_RESTORE", line 410
RMAN-10031: ORA-19624 occurred during call to DBMS_BACKUP_RESTORE.DEVICEALLOCATE

Following the suggestions for reading error message stacks, you look for the Additional information line and notice:

Additional information: 7004

You discover that error 7004 means that the media management device is busy. So, the media management software is not able to send the files on the device because it is in use or there is a problem with it. Note also that when you read from the bottom up, the first line says that an error occurred during a call to a PL/SQL program unit called DEVICEALLOCATE. Also, the first message below the stack banner says that there was an error executing the ALLOCATE command. All of this information indicates that RMAN was not able to allocate an sbt channel because of the device.

Note:

The sbtio.log contains information written by the media management software, not the Oracle database server. Hence, you must consult your media vendor documentation to interpret the error codes and messages.

Identifying RMAN Return Codes

One way to determine whether RMAN encountered an error is to examine its return code. RMAN returns 0 to the operating system if no errors occurred, a nonzero value otherwise. For example, if you are running UNIX with the C shell, then, when RMAN completes, the return code will be in a shell variable called $status.

Testing the Media Management API

On specific platforms, Oracle provides a diagnostic tool called sbttest. This utility performs a simple test of the media management software by acting as the Oracle database server and attempting to communicate with the media manager.

Obtaining the sbttest Utility

On UNIX, the sbttest utility is located in $ORACLE_HOME/bin. If for some reason the utility is not included with your platform, then contact Oracle Support to obtain the C version of the program. You can compile this version of the program on all UNIX platforms.

Note that on platforms such as Solaris, you do not have to relink when using sbttest. On other platforms, relinking may be necessary.

Obtaining Online Documentation for the sbttest Utility

For online documentation of sbttest, issue the following on the command line:

% sbttest

The program displays the list of possible arguments for the program:

Error: backup file name must be specified
Usage: sbttest backup_file_name        # this is the only required parameter
               <-dbname database_name>
               <-trace trace_file_name>
               <-remove_before>
               <-no_remove_after> 
               <-read_only>
               <-no_regular_backup_restore>
               <-no_proxy_backup>
               <-no_proxy_restore>
               <-file_type n>
               <-copy_number n>
               <-media_pool n>
               <-os_res_size n>
               <-pl_res_size n>
               <-block_size block_size> 
               <-block_count block_count>
               <-proxy_file os_file_name bk_file_name 
                           [os_res_size pl_res_size block_size block_count]>

The display also indicates the meaning of each argument. For example, following is the description for two optional parameters:

Optional parameters:
  -dbname  specifies the database name which will be used by SBT 
           to identify the backup file. The default is "sbtdb"
  -trace   specifies the name of a file where the Media Management 
           software will write diagnostic messages.

Using the sbttest Utility

Use sbttest to perform a quick test of the media manager. The following table explains how to interpret the output.

If sbttest returns . . .	Then . . .
0	The program ran without error. In other words, the media manager is installed and can accept a data stream and return the same data when requested.
a nonzero value	The program encountered an error. Either the media manager is not installed or it is not configured correctly.

To use sbttest:

Make sure the program is installed and included in the system path by typing sbttest at the command line:
```
% sbttest
```
If the program is operational, then you should see a display of the online documentation.
Execute the program, specifying any of the arguments described in the online documentation. For example, enter the following to create test file some_file.f and write the output to sbtio.log:
```
% sbttest some_file.f -trace sbtio.log
```
You can also test a backup of an existing datafile. For example, this command tests datafile tbs_33.f of database prod:
```
% sbttest tbs_33.f -dbname prod
```

Examine the output. If the program encounters an error, then it provides messages describing the failure. For example, if Oracle cannot find the library, you see:

libobk.so could not be loaded. Check that it is installed properly, and that LD_
LIBRARY_PATH environment variable (or its equivalent on your platform) includes the 
directory where this file can be found. Here is some additional information on the 
cause of this error:
ld.so.1: sbttest: fatal: libobk.so: open failed: No such file or directory

Note that in some cases sbttest can work but an RMAN backup does not. The reasons can be the following:

The user who starts sbttest is not the owner of the Oracle processes.
If the Oracle server is not linked with the media management library, then sbttest can still work.
The sbttest program passes all environment parameters from the shell but RMAN does not.

Terminating an RMAN Command

You have the following methods for terminating an RMAN command while it is executing:

Press CTRL+C (or the equivalent "attention" key combination for your system) in the RMAN interface, which is the preferred method. This operation also terminates allocated channels unless they are hung in the media management code, for example, when they are waiting for a tape to be mounted.
Kill the server session corresponding to the RMAN channel by running the SQL ALTER SYSTEM statement.
Terminate the server session corresponding to the RMAN channel on the operating system.

Terminating the Session with ALTER SYSTEM

You can identify the Oracle session ID for an RMAN channel by looking in the RMAN log for messages with the format shown in the following example:

channel ch1: sid=15 devtype=DISK

The sid and devtype are displayed for each allocated channel. Note that the Oracle sid is different from the operating system process ID. You can kill the session by specifying the sid in a SQL statement, but the commands are not the same as the operating system process kill commands.

You can specify the sid in the SQL statement ALTER SYSTEM KILL SESSION command. It takes two arguments (the sid printed in the RMAN message and a serial number), both of which can be obtained by querying V$SESSION. For example, run the following statement, where sid_in_rman_output is the number from the RMAN message:

SELECT SERIAL# FROM V$SESSION WHERE SID=sid_in_rman_output;

Then, run the following statement, substituting the sid_in_rman_output and serial number obtained from the query:

ALTER SYSTEM KILL SESSION 'sid_in_rman_output,serial#';

Note that this is no more effective than killing at the operating system level if the process is hung in the media manager.

Terminating the Session at the Operating System Level

Finding and killing the processes that are associated with the server sessions is operating system specific. Note that on some platforms the server sessions are not associated with any processes at all. See your operating system specific documentation for more information.

Terminating an RMAN Session That Is Hung in the Media Manager

You may sometimes need to kill an RMAN job that is hanging when RMAN is interacting with a media manager. The best way to terminate RMAN when the connections for the allocated channels are hung in the media manager is to abort the session in the media manager. If this action does not solve the problem, then the next step is to kill the Oracle processes of the connections. Note that killing the Oracle process can cause problems for the media manager.

This section contains these topics:

Components of an RMAN Session

The nature of an RMAN session depends on the operating system. In UNIX, an RMAN session has the following processes associated with it:

The RMAN process itself
The catalog connection to the recovery catalog database (only if you use a recovery catalog)
An auxiliary connection to an auxiliary instance (if running DUPLICATE or performing TSPITR, none otherwise)
The initial connection to the target database, also called the default channel
A polling connection to the target database used for monitoring RMAN command execution on the various allocated channels. By default, RMAN makes one polling connection. RMAN makes additional polling connections if you use different connect strings in the ALLOCATE CHANNEL or CONFIGURE CHANNEL commands. One polling connection exists for each distinct connect string used in the ALLOCATE CHANNEL or CONFIGURE CHANNEL command.
One target connection to the target database corresponding to each allocated channel

Process Behavior During a Hung Job

RMAN usually hangs because one of the channel connections is waiting in the media manager code for a tape resource. The catalog connection and the default channel seem to hang because they are waiting for RMAN to tell them what to do. Polling connections seem to be in an infinite loop while polling the RPC under the control of the RMAN process.

If you kill the RMAN process itself, then you also kill the catalog connection, the auxiliary connection, the default channel, and the polling connections. Target and auxiliary connections that are not hung in the media manager code are also terminated: only the target and auxiliary connections executing in the media management layer remains active. You must manually kill this process because terminating its session does not kill it. Even after termination, the media manager may keep resources busy or continue processing because it does not realize that the Oracle process is gone. This behavior depends on which media manager you use.

Terminating the catalog connection does not cause RMAN to finish because RMAN is not performing catalog operations while the backup or restore is in progress. Removing default channel and polling connections causes the RMAN process to detect that one of the channels has died and then proceed to exit. In this case, the connections to the hung channels remain active as described previously.

Terminating an RMAN Session: Basic Steps

The best way to terminate RMAN when the connections for the allocated channels are hung in the media manager is to kill the Oracle process of the connections. The RMAN process detects this termination and proceed to exit, removing all connections except target connections that are still operative in the media management layer. The caveat about the media manager resources still applies in this case.

To terminate an Oracle process that is hung in the media manager:

This procedure is operating system specific.

Obtain the current stack trace for the desired process ID by using an operating system specific utility. For example, on Solaris you can use the /usr/proc/bin/pstack command to obtain the stack.
After the stack is obtained, look for the process with sbtxxxx (normally sbtopen) in the stack trace.
Obtain the stack again after a few minutes. If the same stack trace is returned, then you have identified the hung process.
Kill the hung process with an operating system specific utility. For example, on Solaris execute a kill -9 command.
Repeat this procedure for all hung channels in the media management code.
Check that the media manager also clears its processes, otherwise the next backup or restore may still hang due to the previous hang. In some media managers, the only solution is to shut down and restart the media manager. If the documentation from the media manager is not helpful, ask the media manager technical support for the correct solution.

See Also:
Your operating system specific documentation for the relevant commands

RMAN Troubleshooting Scenarios

This section contains these topics:

After Installation of Media Manager, RMAN Channel Allocation Fails: Scenario

In this scenario, you install and test the media manager as explained in "Configuring RMAN to Make Backups to a Media Manager", but you still cannot make RMAN back up to tape. For example, after allocating the sbt channel, you receive an error stack similar to the following:

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-00579: the following error occurred at 03/02/2001 10:21:18
RMAN-03007: retryable error occurred during execution of command: allocate
RMAN-12004: unhandled exception during command execution on channel foo
RMAN-10035: exception raised in RPC: ORA-19554: error allocating device, 
            device type: SBT_TAPE,device name:
ORA-19557: device error, device type: SBT_TAPE, device name:
ORA-27211: Failed to load Media Management Library
           Additional information: 2
RMAN-10031: ORA-19624 occurred during call to DBMS_BACKUP_RESTORE.DEVICEALLOCATE

After Installation of Media Manager, RMAN Channel Allocation Fails: Diagnosis

The ORA-27211 error indicates that the channel allocation is failing because Oracle is not loading the media management library. If the channel allocation fails, then Oracle generates a trace file in the USER_DUMP_DEST location that contains the error that caused the channel allocation to fail. The trace file should have the complete path name of the media management library loaded by Oracle as well as any other media manager errors or operating system errors. For example, the trace file on UNIX may be called something like /oracle/rdbms/log/prod1_ora_16226.trc, and may contain information such as the following:

*** SESSION ID:(10.1) 2001-02-16 17:55:12.941 
SKGFQ OSD: Error in function sbtinit on line 2272 
SKGFQ OSD: Look for SBT Trace messages in file /oracle/rdbms/log/sbtio.log
SBT Initialize failed for oracle.static

The last line of this output indicates that Oracle is loading the default static library instead of the media management library that you installed.

To test the loading of the media management library, try allocating a channel by using the PARMS parameter SBT_LIBRARY to force the loading of the media management library. For example, if your library is called /vendor/lib/some_mm_lib.so and is pointed to by $ORACLE_HOME/lib/libobk.so, then run a command such as the following, making sure to specify whatever PARMS settings are required by your media manager:

RUN
{
  ALLOCATE CHANNEL c1 DEVICE TYPE sbt 
    PARMS='SBT_LIBRARY=/oracle/lib/libobk.so',
          'ENV=(NSR_SERVER=tape_svr,NSR_CLIENT=oracleclnt,NSR_GROUP=oracle_tapes)';
}

If the channel allocation fails, then check the trace file again to see whether you can learn anything new. If the channel allocation with SBT_LIBRARY succeeds, but an ordinary sbt channel allocation fails, then Oracle is probably trying to load a library ($ORACLE_HOME/lib/libobk.so on UNIX, %ORACLE_HOME%/bin/orasbt.dll on NT) other than the one you installed. You may have more than one library in the operating system path, and the one that Oracle is loading is the wrong one.

After Installation of Media Manager, RMAN Channel Allocation Fails: Solution

If the problem is that Oracle is not loading the correct library, then make sure that the library is named correctly. For example, on UNIX you should name it $ORACLE_HOME/rdbms/libobk.so or create a symbolic link with this name that points to your library, and on Windows NT you should name it %ORACLE_HOME%/bin/orasbt.dll. It is possible to place these files in nondefault directories, but they must be included in the system path so that Oracle can locate them.

See Also:

Oracle9i Recovery Manager Reference for descriptions of the legal PARMS parameters

Backup Job Is Hanging: Scenario

In this scenario, an RMAN backup job starts as normal and then pauses inexplicably:

Recovery Manager: Release 9.0.1.0.0 - Production
connected to target database: PROD1
connected to recovery catalog database

RMAN> BACKUP TABLESPACE SYSTEM, users;

allocated channel: t1
channel t1: sid=16 devtype=SBT_TAPE

channel t1: starting datafile backupset
set_count=15 set_stamp=338309600
channel t1: including datafile 2 in backupset
channel t1: including datafile 1 in backupset
channel t1: including current controlfile in backupset
# Hanging here for 30 minutes now

Backup Job Is Hanging: Diagnosis

If a backup job is hanging, that is, not proceeding, then several scenarios are possible:

The job abnormally terminated.
A server-side or media management error occurred.
RMAN is waiting for an event such as the insertion of a new cassette into the tape device.

Backup Job Is Hanging: Solution

Because the causes of a hung backup job can be varied, so are the solutions. The best practice is to look for the simplest solutions first. For example, backup jobs often hang simply because the tape device has completely filled the current cassette and is waiting for a new tape to be inserted.

If the media manager is not waiting for a new tape, then examine media manager process, log, and trace files for signs of abnormal termination or other errors (refer to the description of message files in "Identifying Types of Message Output").

On the Oracle side, check to see what the server session performing the backup are doing. The RMAN output prints the SID of the server session, as in this example:

channel ORA_DISK_1: sid=12 devtype=DISK

How many processes are hanging? If only one, check what it is doing by querying V$SESSION_WAIT. For example, to see what server session 12 is doing, enter:

SQL> SELECT * FROM V$SESSION_WAIT WHERE WAIT_TIME = 0 AND SID = 12;

See Also:

"Correlating Server Sessions with Channels" to learn how to obtain the SID value, and "Terminating an RMAN Session: Basic Steps" to learn how to kill an RMAN session that is hanging

RMAN Fails to Start RPC Call: Scenario

In this scenario, you run a backup job and receive message output similar to the following:

channel c8: including datafile number 47 in backupset
RPC call appears to have failed to start on channel c9
RPC call ok on channel c9
channel c3: including datafile number 18 in backupset

RMAN Fails to Start RPC Call: Diagnosis

The RPC call appears to have failed message does not usually indicate a problem. The message indicates one of the following:

The target database instance is slow.
A timing problem occurred.

Timing problems occur in this way. When RMAN begins an RPC, it checks the V$SESSION performance view. The RPC updates the information in the view to indicate when it starts and finishes. Sometimes RMAN checks V$SESSION before the RPC has indicated it has started, which in turn generates the following message:

RPC call appears to have failed

If a message stating "RPC call ok" does not appear in the output immediately following the message stating "RPC call appears to have failed", then the backup job encountered a problem.

Backup Fails with Invalid RECID Error: Scenario

In this scenario, you attempt a backup and receive the following error messages:

RMAN-3014: Implicit resync of recovery catalog failed
RMAN-6038: Recovery catalog package detected an error
RMAN-20035: Invalid high RECID error

Backup Fails with Invalid RECID Error: Diagnosis

You probably restored a backup control file created through a non-Oracle mechanism, and then opened the database without performing a RESETLOGS operation. If you had created the backup control file through the RMAN BACKUP command or the ALTER DATABASE BACKUP CONTROLFILE statement, then Oracle would have required you to reset the logs.

The control file and the recovery catalog are now not synchronized. The database control file is older than the recovery catalog, because at one time the recovery catalog resynchronized with the old current control file, and now the database is using a backup control file. RMAN detects that the control file currently in use is older than the control file previously used to resynchronize.

Backup Fails with Invalid RECID Error: Solution

You can follow either of these procedures, although the first procedure is safer and is strongly recommended:

To reset the database with RMAN:

Connect to the target database with SQL*Plus. For example, enter:
```
% sqlplus 'SYS/oracle@prod1 AS SYSDBA'
```
Mount the database if it is not already mounted. For example, enter:
```
SQL> ALTER DATABASE MOUNT;
```
Start cancel-based recovery by using the backup control file, then cancel it. The reason for canceling is that the USING BACKUP CONTROLFILE clause stamps the controlfile as a backup, which then permits OPEN RESETLOGS. For example, enter:
```
SQL> ALTER DATABASE RECOVER DATABASE UNTIL CANCEL USING BACKUP CONTROLFILE;
SQL> ALTER DATABASE RECOVER CANCEL;
```
Open the database with the RESETLOGS option. For example, enter:
```
SQL> ALTER DATABASE OPEN RESETLOGS;
```
Use RMAN to connect to the target database and recovery catalog. For example, enter:
```
% rman TARGET SYS/oracle@prod1 CATALOG rman/rman@rcat
```
Reset the database. For example, enter:
```
RMAN> RESET DATABASE;
```
Take new backups so that you can recover the database if necessary. For example, enter:
```
RMAN> BACKUP DATABASE PLUS ARCHIVELOG;
```

To create the control file with SQL*Plus:

Connect to the target database with SQL*Plus. For example, enter:
```
% sqlplus 'SYS/oracle@prod1 AS SYSDBA'
```
Mount the database if it is not already mounted:
```
SQL> ALTER DATABASE MOUNT;
```

Back up the control file to a trace file:

SQL> ALTER DATABASE BACKUP CONTROLFILE TO TRACE;

Edit the trace file as necessary. The trace file looks something like the following:

*** SESSION ID:(8.1) 2000.12.09.13.26.36.000
*** 2000.12.09.13.26.36.000
# The following statements will create a new control file and use it
# to open the database.
# Data used by the recovery manager will be lost. Additional logs may
# be required for media recovery of offline data files. Use this
# only if the current version of all online logs are available.
STARTUP NOMOUNT
CREATE CONTROLFILE REUSE DATABASE "PROD1" NORESETLOGS ARCHIVELOG
    MAXLOGFILES 32
    MAXLOGMEMBERS 2
    MAXDATAFILES 32
    MAXINSTANCES 1
    MAXLOGHISTORY 1012
LOGFILE
  GROUP 1 '/oracle/dbs/t1_log1.f'  SIZE 200K,
  GROUP 2 '/oracle/dbs/t1_log2.f'  SIZE 200K
DATAFILE
  '/oracle/dbs/tbs_01.f',
  '/oracle/dbs/tbs_02.f',
  '/oracle/dbs/tbs_11.f',
  '/oracle/dbs/tbs_12.f',
  '/oracle/dbs/tbs_21.f',
  '/oracle/dbs/tbs_22.f',
 CHARACTER SET WE8DEC
;
# Configure snapshot controlfile filename
EXECUTE SYS.DBMS_BACKUP_RESTORE.CFILESETSNAPSHOTNAME('/oracle/dbs/snapcf_prod1.f');
# Recovery is required if any of the datafiles are restored backups,
# or if the last shutdown was not normal or immediate.
RECOVER DATABASE
# All logs need archiving and a log switch is needed.
ALTER SYSTEM ARCHIVE LOG ALL;
# Database can now be opened normally.
ALTER DATABASE OPEN;
# No tempfile entries found to add.

Shut down the database:
```
SHUTDOWN IMMEDIATE
```

Execute the script to create the control file, recover (if necessary), archive the logs, and open the database:

STARTUP NOMOUNT
CREATE CONTROLFILE ...;
EXECUTE ...;
RECOVER DATABASE
ALTER SYSTEM ARCHIVE LOG ALL;
ALTER DATABASE OPEN ...;

Caution:

If you do not open with the RESETLOGS option, then two copies of an archived redo log for a given log sequence number may exist--even though these two copies have completely different contents. For example, one log may have been created on the original host and the other on the new host. If you accidentally confuse the logs during a media recovery, then the database will be corrupted but Oracle and RMAN cannot detect the problem.

Backup Fails Because of Control File Enqueue: Scenario

In this scenario, a backup job fails because RMAN cannot make a snapshot control file. The message stack is as follows:

set_count=11 set_stamp=333299261
channel dev1: including datafile 1 in backupset
waiting for snapshot controlfile enqueue
waiting for snapshot controlfile enqueue
cannot make a snapshot controlfile
error recovery releasing channel resources
released channel: dev1

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03006: non-retryable error occurred during execution of command: backup
RMAN-07004: unhandled exception during command execution on channel dev1
RMAN-10032: unhandled exception during execution of job step 1: ORA-06512: at line 90
RMAN-10035: exception raised in RPC: ORA-00230: operation disallowed: snapshot controlfile 
            enqueue unavailable
ORA-06512: at "SYS.DBMS_BACKUP_RESTORE", line 1826
RMAN-10031: ORA-230 occurred during call to DBMS_BACKUP_RESTORE.CFILEMAKEANDUSESNAPSHOT

Backup Fails Because of Control File Enqueue: Diagnosis

When RMAN needs to back up or resynchronize from the control file, it first creates a snapshot or consistent image of the control file. If one RMAN job is already backing up the control file while another needs to create a new snapshot control file, then you may see the following message:

waiting for snapshot controlfile enqueue

Under normal circumstances, a job that must wait for the control file enqueue waits for a brief interval and then successfully obtains the enqueue. Recovery Manager makes up to five attempts to get the enqueue and then fails the job. The conflict is usually caused when two jobs are both backing up the control file, and the job that first starts backing up the control file waits for service from the media manager.

To determine which job is holding the conflicting enqueue:

After you see the first message stating "RMAN-08512: waiting for snapshot controlfile enqueue", start a new SQL*Plus session on the target database:
```
% sqlplus 'SYS/sys_pwd@prod1 AS SYSDBA'
```

Execute the following query to determine which job is causing the wait:

SELECT s.SID, USERNAME AS "User", PROGRAM, MODULE, ACTION, LOGON_TIME "Logon", l.* 
FROM V$SESSION s, V$ENQUEUE_LOCK l
WHERE l.SID = s.SID and l.TYPE = 'CF' AND l.ID1 = 0 and l.ID2 = 2;

You should see output similar to the following (the output in this example has been truncated):

SID User Program              Module                    Action           Logon
--- ---- -------------------- ------------------------- ---------------- ---------
  9 SYS  rman@h13 (TNS V1-V3) backup full datafile: c1  0000210 STARTED  21-JUN-01

Backup Fails Because of Control File Enqueue: Solution

After you have determined which job is creating the enqueue, you can do one of the following:

Wait until the job creating the enqueue completes
Cancel the current job and restart it after the job creating the enqueue completes
Cancel the job creating the enqueue

Commonly, enqueue situations occur when a job is writing to a tape drive, but the tape drive is waiting for a new cassette to be inserted. If you start a new job in this situation, then you will probably receive the enqueue message because the first job cannot complete until the new tape is loaded.

RMAN Fails to Delete All Archived Logs: Scenario

In this scenario, the database archives automatically to two directories: /oracle/arch/dest1 and /oracle/arch/dest2. You tell RMAN to perform a backup and delete the input archived redo logs afterward in the following script:

BACKUP ARCHIVELOG ALL DELETE INPUT;

You then run a crosscheck to make sure the logs are gone and find the following:

CROSSCHECK ARCHIVELOG ALL;

validation succeeded for archived log
archivelog filename=/oracle/arch/dest2/arcr_1_964.arc recid=19 stamp=368726072

RMAN deleted one set of logs but not the other.

RMAN Fails to Delete All Archived Logs: Diagnosis

This problem is not an error. When you specify DELETE INPUT without the ALL keyword, RMAN deletes only one copy of each input log. Even if you archive to five destinations, RMAN deletes logs from only one directory.

RMAN Fails to Delete All Archived Logs: Solution

To force RMAN to delete all existing archived redo logs, use the DELETE ALL INPUT clause of the BACKUP command. For example, enter:

BACKUP ARCHIVELOG ALL DELETE ALL INPUT;

Backup Fails Because RMAN Cannot Locate an Archived Log: Scenario

In this scenario, you schedule regular incremental backups of the database. The next time you make a backup, you receive this error:

RMAN-6089:  archive log NAME not found or out of sync with catalog

Backup Fails Because RMAN Cannot Locate an Archived Log: Diagnosis

This problem occurs when the archived log that RMAN is looking for cannot be accessed by RMAN, or the recovery catalog needs to be resynchronized. Often, this error occurs when you delete archived logs with an operating system command, which means that RMAN is unaware of the deletion. The RMAN-6089 error occurs because RMAN attempts to back up a log that the repository indicates still exists.

Backup Fails Because RMAN Cannot Locate an Archived Log: Solution

Make sure that the archived logs exists in the specified directory and that the RMAN catalog is synchronized. Check the following:

Make sure the archived log file that is specified by the RMAN-6089 error exists in the correct directory.
Check that the operating system permissions are correct for the archived log (owner = oracle, group = DBA) to make sure that RMAN can access the file.
If the file appears to be correct, then try synchronizing the catalog by running the following command from the RMAN prompt:
```
RESYNC CATALOG;
```

If you know that the logs are unavailable because you deleted them by using an operating system utility, then run the following command at the RMAN prompt to update RMAN metadata:

CROSSCHECK ARCHIVELOG ALL;

It is always better to use RMAN to delete logs than to use an operating system utility. The easiest method to remove unwanted logs is to specify the DELETE INPUT option when backing up archived logs. For example, enter:

BACKUP DEVICE TYPE sbt 
  ARCHIVELOG ALL 
  DELETE ALL INPUT;

RMAN Cannot Set Target Database Character Set: Scenario

In this scenario, you are running a release 8.1.5 version of the RMAN executable and trying to connect to a release 8.0.4 target database. You receive the following error messages when you try to connect to the target database:

% rman CATALOG rman/rman@rcat 
 
Recovery Manager: Release 8.1.5.0.0 - Production 
RMAN-06008: connected to recovery catalog database 
 
RMAN> CONNECT TARGET sys/oscar123@nc0d 
RMAN-00571: =========================================================== 
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ==== 
RMAN-00571: =========================================================== 
RMAN-04005: error from target database: ORA-06550: line 1, column 7: 
PLS-00201: identifier 'DBMS_BACKUP_RESTORE.SET_CHARSET' must be declared 
ORA-06550: line 1, column 7: PL/SQL: Statement ignored 
RMAN-04015: error setting target database character set to WE8ISO8859P1

RMAN Cannot Set Target Database Character Set: Diagnosis

Typically, this error message means that the DBMS_BACKUP_RESTORE package was not created during the installation of the database. Here are possible causes:

The installation scripts contained errors.
The PL/SQL option, which is required for RMAN, was never installed.

RMAN Cannot Set Target Database Character Set: Solution

If you did not install the PL/SQL option, then install it. If you did install the PL/SQL option, then create the required packages by connecting to SQL*Plus with SYSDBA privileges and running the following scripts:

SQL> @$ORACLE_HOME/rdbms/admin/dbmsbkrs.sql 
SQL> @$ORACLE_HOME/rdbms/admin/prvtbkrs.plb

RMAN Does Not Recognize Character Set Name: Scenario

In this scenario, you are connected to the target database while it is not open and attempting to perform an RMAN operation. You receive the following error:

PLS-00553: character set name is not recognized

RMAN Does Not Recognize Character Set Name: Diagnosis

Typically, this message means that the character set in the client environment, that is, the environment in which you are running the RMAN executable, is different from the character set in the target database environment.

RMAN Does Not Recognize Character Set Name: Solution

Query the target database to determine the value of the NLS_CHARACTERSET parameter. For example, run this query:
```
SQL> 	SELECT VALUE FROM V$NLS_PARAMETERS WHERE PARAMETER='NLS_CHARACTERSET';
```
Set the character set environment variable in the client to the same value as the variable in the server. For example, you can set the NLS_LANG environment variable on a UNIX system as follows:
```
% setenv NLS_LANG WE8ISO8859P1
```

RMAN Denies Logon to Target Database: Scenario

RMAN fails with the following errors when trying to connect to the target database:

% rman
Recovery Manager: Release 9.0.1.0.0 - Production

RMAN> CONNECT TARGET sys/change_on_install@inst1

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-04005: error from target database: 
ORA-01017: invalid username/password; logon denied

Diagnosis of Cause

RMAN automatically requests a connection to the target database as SYSDBA. In order to connect to the target as SYSDBA, you must do one of the following:

Be part of the operating system DBA group with respect to the target database (that is, have the ability to connect with SYSDBA privileges to the target database without a password).
Create a password file with the orapwd command and the initialization parameter REMOTE_LOGIN_PASSWORDFILE.

If the target database does not have a password file, then the user you are logged in as must be validated with operating system authentication.

Solution

Either create a password file for the target database or add yourself to the administrator list in the operating system.

See Also:

Oracle9i Database Administrator's Guide to learn how to create a password file

To learn how to create a password file, see

Database Duplication Fails with RMAN-20240: Scenario

In this scenario, you attempt to duplicate a database to the same host (although it could also be a remote host) using the DUPLICATE command, but get the following error stack during compilation of the RECOVER command:

starting media recovery
unable to find archivelog
archivelog thread=1 sequence=6

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-00601: fatal error in recovery manager
RMAN-03012: fatal error during compilation of command
RMAN-03028: fatal error code: 3015
RMAN-03013: command type: Duplicate Db
RMAN-03015: error occurred in stored script Memory Script
RMAN-03002: failure during compilation of command
RMAN-03013: command type: recover
RMAN-03002: failure during compilation of command
RMAN-03013: command type: recover(4)
RMAN-06038: recovery catalog package detected an error
RMAN-20242: specification does not match any archivelog in the recovery catalog

Database Duplication Fails with RMAN-20240: Diagnosis

The problem is probably that the backup of the datafiles is not consistent, that is, the following SQL statement was not issued after the datafile backup:

SQL> ALTER SYSTEM ARCHIVE LOG CURRENT;

Consequently, the DUPLICATE command is attempting to read the online redo logs for the necessary redo records.

Database Duplication Fails with RMAN-20240: Solution

When creating the duplication script, use the SET UNTIL command to specify a log sequence number for incomplete recovery. For example, to stop recovery at log sequence 5, enter:

RUN
{
  SET UNTIL SEQUENCE 5 THREAD 1;
  DUPLICATE TARGET DATABASE TO dupdb;

}

See Also:

"Creating a Non-Current Duplicate Database: Example" for more information about performing incomplete recovery during the duplication operation

UNKNOWN Database Name Appears in Recovery Catalog: Scenario

In this scenario, you list the database incarnations registered in the recovery catalog and see a database with the name UNKNOWN:

LIST INCARNATION OF DATABASE;  
 
RMAN-03022: compiling command: list  
List of Database Incarnations  
DB Key  Inc Key   DB Name   DB ID       CUR    Reset SCN    Reset Time
------- -------   -------   ------      ---    ----------   ----------
56      57        SKDHRA    4052472287  YES    1            Sep 03 2000 06:45:51  
1       19        UNKNOWN   4141147584  NO     1            Jan 08 2000 14:47:28  
1       2         SKDHRC    4141147584  YES    14602        Jan 15 2000 15:32:57

UNKNOWN Database Name Appears in Recovery Catalog: Diagnosis

One way you get the DB_NAME of UNKNOWN is when you register a database that was once opened with the RESETLOGS option. The DB_NAME can be changed during a RESETLOGS operation, so RMAN does not know what the DB_NAME was for those old incarnations of the database because it was not registered in the recovery catalog at the time. Consequently, RMAN sets the DB_NAME column to UNKNOWN when creating the DBINC record.

UNKNOWN Database Name Appears in Recovery Catalog: Solution

The UNKNOWN name entry is expected behavior after a RESETLOGS operation. You should not attempt to remove UNKNOWN entries from the recovery catalog.

15 Recovery Manager Troubleshooting

Interpreting RMAN Message Output

Identifying Types of Message Output

Recognizing RMAN Error Message Stacks

Identifying Error Codes

RMAN Error Message Numbers

Table 15-1 RMAN Error Message Ranges

Media Manager Error Numbers

Table 15-2 Media Manager Error Message Ranges (Page 1 of 2)

Interpreting RMAN Error Stacks

Interpreting RMAN Errors: Example

Interpreting Server Errors: Example

Interpreting Media Management Errors: Example

Identifying RMAN Return Codes

Testing the Media Management API

Obtaining the sbttest Utility

Obtaining Online Documentation for the sbttest Utility

Using the sbttest Utility

Terminating an RMAN Command

Terminating the Session with ALTER SYSTEM

Terminating the Session at the Operating System Level

Terminating an RMAN Session That Is Hung in the Media Manager

Components of an RMAN Session

Process Behavior During a Hung Job

Terminating an RMAN Session: Basic Steps

RMAN Troubleshooting Scenarios

After Installation of Media Manager, RMAN Channel Allocation Fails: Scenario

After Installation of Media Manager, RMAN Channel Allocation Fails: Diagnosis

After Installation of Media Manager, RMAN Channel Allocation Fails: Solution

Backup Job Is Hanging: Scenario

Backup Job Is Hanging: Diagnosis

Backup Job Is Hanging: Solution

RMAN Fails to Start RPC Call: Scenario

RMAN Fails to Start RPC Call: Diagnosis

Backup Fails with Invalid RECID Error: Scenario

Backup Fails with Invalid RECID Error: Diagnosis

Backup Fails with Invalid RECID Error: Solution

Backup Fails Because of Control File Enqueue: Scenario

Backup Fails Because of Control File Enqueue: Diagnosis

Backup Fails Because of Control File Enqueue: Solution

RMAN Fails to Delete All Archived Logs: Scenario

RMAN Fails to Delete All Archived Logs: Diagnosis

RMAN Fails to Delete All Archived Logs: Solution

Backup Fails Because RMAN Cannot Locate an Archived Log: Scenario

Backup Fails Because RMAN Cannot Locate an Archived Log: Diagnosis

Backup Fails Because RMAN Cannot Locate an Archived Log: Solution

RMAN Cannot Set Target Database Character Set: Scenario

RMAN Cannot Set Target Database Character Set: Diagnosis

RMAN Cannot Set Target Database Character Set: Solution

RMAN Does Not Recognize Character Set Name: Scenario

RMAN Does Not Recognize Character Set Name: Diagnosis

RMAN Does Not Recognize Character Set Name: Solution

RMAN Denies Logon to Target Database: Scenario

Diagnosis of Cause

Solution

Database Duplication Fails with RMAN-20240: Scenario

Database Duplication Fails with RMAN-20240: Diagnosis

Database Duplication Fails with RMAN-20240: Solution

UNKNOWN Database Name Appears in Recovery Catalog: Scenario

UNKNOWN Database Name Appears in Recovery Catalog: Diagnosis

UNKNOWN Database Name Appears in Recovery Catalog: Solution

15
Recovery Manager Troubleshooting