4 Managing Outages

This chapter describes unscheduled and scheduled outages and the Oracle operational best practices that can tolerate or manage each outage type and minimize downtime.

This chapter contains these topics:

4.1 Outage Overview

This section describes the types of possible outages and the recommended methods to repair or minimize the downtime associated with each outage.

This section contains these topics:

4.1.1 Unscheduled Outages

Unscheduled outages are unanticipated failures in any part of the technology infrastructure that supports the application, including the following components:

  • Hardware

  • Software

  • Network infrastructure

  • Naming services infrastructure

  • Database

Your monitoring and high-availability infrastructure should provide rapid detection and recovery from downtime. Chapter 3, "Monitoring Using Oracle Grid Control" describes detection, while this chapter focuses on reducing downtime.

Table 4-1 describes unscheduled outages that affect the primary or secondary site components.

Table 4-1 Unscheduled Outages

Outage Type Description Examples

Site failure

The entire site where the current production database resides is unavailable. This includes all tiers of the application.

Disaster at the production site such as a fire, flood, or earthquake

Malicious attack on the site

Power outages. If there are multiple power grids and backup generators for critical systems, then this should affect only part of the data center.

Clusterwide failure

The whole cluster hosting the RAC database is unavailable or fails. This includes failures of nodes in the cluster, and any other components that result in the cluster being unavailable and the Oracle database and instances on the site being unavailable.

The last surviving node on the RAC cluster fails and cannot be restarted

Both of the redundant cluster interconnects fail or clusterware failure or problem

Database corruption is severe enough to disallow continuity on the current data server

Disk storage fails

Computer failure (node)

A node of the RAC cluster is unavailable or fails

A database tier node fails or has to be shut down because of bad memory or bad CPU

The database tier node is unreachable

Both of the redundant cluster interconnects fail, resulting in another node taking ownership

Computer failure (instance)

A database instance is unavailable or fails

An instance of the RAC database on the data server fails because of a software bug, an operating system problem, or a hardware problem.

Storage failure

Storage holding some or all of the database contents becomes unavailable, because it has shut down or is no longer accessible

Disk drive failure

Disk controller failure

Storage array failure

Data corruption

Parts of the database are unavailable because of media corruption, inaccessibility, or inconsistencies

A datafile is accidentally removed or is unavailable

Media corruption affects blocks of the database

Oracle block corruption is caused by operating system or other node-related problems

Human error

Parts of the database are unavailable, and transactional or logical data inconsistencies arise. Usually caused by an operator or bugs in the application code.

Localized damage (needs surgical repair) Human error results in a table being dropped or in rows being deleted from a table

Widespread damage (needs drastic action to avoid downtime) Application errors result in logical corruption in the database, or operator error results in a batch job being run more times than specified.

Note: This category focuses on human errors that affect database availability and, in particular, cause transactional or logical data inconsistencies.


The best practice recommendations for reducing unscheduled outages on the primary site and the secondary site, estimated recovery times, and recovery steps appear in the following sections:

4.1.1.1 Managing Unscheduled Outages on the Primary Site

If the primary site contains the production database and the secondary site contains the standby database, then outages on the primary site are the most crucial. Solutions for these outages are critical for maximum availability of the system. Only the Oracle Database 10g with Data Guard, and the Oracle Database 10g with RAC and Data Guard (MAA) architectures have a secondary site to protect from site disasters.

Table 4-2 summarizes the recovery steps for unscheduled outages on the primary site. For outages that require multiple recovery steps, the table includes links to the detailed descriptions in Section 4.2, "Recovering from Unscheduled Outages" that starts.

Table 4-2 Recovery Times and Steps for Unscheduled Outages on the Primary Site

Outage Type Oracle Database 10g Oracle Database 10g with RAC Oracle Database 10g with Data Guard Oracle Database 10g - MAA

Site failure

Hours to days

  1. Restore site.

  2. Restore from tape backups.

  3. Recover database.

Hours to days

  1. Restore site.

  2. Restore from tape backups.

  3. Recover database.

Seconds to 5 minutesFoot 1 

  1. Database Failover with a Standby Database

  2. Complete Site Failover

  3. Application Failover

Seconds to 5 minutesFootref 1

  1. Database Failover with a Standby Database

  2. Complete Site Failover

  3. Application Failover

Clusterwide failure

Not applicable

Hours to days

  1. Restore cluster or restore at least one node.

  2. Optionally restore from tape backups if the data is lost or corrupted.

  3. Recover database.

Not applicable

Seconds to 5 minutes

  1. Database Failover with a Standby Database

  2. Application Failover

Computer failure (node)

Minutes to hoursFoot 2 

  1. Restart node and restart database.

  2. Reconnect users.

No downtimeFoot 3 

Managed automatically by RAC Recovery for Unscheduled Outages

Seconds to 5 minutesFootref 2

  1. Database Failover with a Standby Database

  2. Application Failover

or

Minutes to hours2

  1. Restart node and restart database.

  2. Reconnect users.

No downtimeFootref 3

Managed automatically by RAC Recovery for Unscheduled Outages

Computer failure (instance)

MinutesFootref 2

  1. Restart instance.

  2. Reconnect users.

No downtimeFootref 3

Managed automatically by RAC Recovery for Unscheduled Outages

MinutesFootref 2

  1. Restart instance.

  2. Reconnect users.

or

Seconds to 5 minutesFootref 1

  1. Database Failover with a Standby Database

  2. Application Failover

No downtimeFootref 3

Managed automatically by RAC Recovery for Unscheduled Outages

Storage failure

No downtimeFoot 4 

ASM Recovery After Disk and Storage Failures

No downtimeFootref 4

ASM Recovery After Disk and Storage Failures

No downtimeFootref 4

ASM Recovery After Disk and Storage Failures

No downtimeFootref 4

ASM Recovery After Disk and Storage Failures

Data corruption

HARD prevents data corruptionFoot 5 

Potentially hours

Recovering from Data Corruption (Data Failures)

HARD prevents data corruption5

Potentially hours

Recovering from Data Corruption (Data Failures)

HARD prevents data corruption5

or

Seconds to 5 minutesFootref 1

  1. Database Failover with a Standby Database

  2. Application Failover

HARD prevents data corruption5

or

Seconds to 5 minutesFootref 1

  1. Database Failover with a Standby Database

  2. Application Failover

Human error

< 30 minutesFoot 6 

Recovering from Human Error

< 30 minutesFootref 6

Recovering from Human Error

<30 minutesFootref 6

Recovering from Human Error

< 30 minutesFootref 6

Recovering from Human Error


Footnote 1 Recovery time indicated applies to database and existing connection failover. Network connection changes and other site-specific failover activities may lengthen overall recovery time.

Footnote 2 Recovery time consists largely of the time it takes to restore the failed system.

Footnote 3 Database is still available, but portion of application connected to failed system is temporarily affected.

Footnote 4 Storage failures are prevented by using Automatic Storage Management (ASM) with mirroring and its automatic rebalance capability.

Footnote 5 Not all types of data corruption are prevented. For the most recent information about the HARD initiative, refer to Section 2.1.6, "Consider HARD-Compliant Storage".

Footnote 6 Recovery times from human errors depend primarily on detection time. If it takes seconds to detect a malicious DML or DLL transaction, then it typically only requires seconds to flashback the appropriate transactions.

4.1.1.2 Managing Unscheduled Outages on the Secondary Site

For most cases, outages on the secondary site can be managed with no effect on availability of the primary database located on the primary site. However, if the configuration is in maximum protection mode, then unscheduled outages on the last surviving standby database will cause outages on the production database to ensure no data loss when failing over to the standby database. After downgrading the data protection mode, you can restart the production database even without accessibility to the standby databases. Outages on the secondary site might affect the maximum time to recovery (MTTR) if there are concurrent failures on the primary site.

Table 4-3 summarizes the recovery steps for unscheduled outages of the standby database on the secondary site. For outages that require multiple recovery steps, the table includes links to the detailed descriptions in Section 4.2, "Recovering from Unscheduled Outages" that starts.

Table 4-3 Recovery Steps for Unscheduled Outages on the Secondary Site

Outage Type Oracle Database 10g with Data Guard Oracle Database 10g - MAA

Computer failure (instance)

  1. Restart node and standby instance.

  2. Restart recovery.

If there is only one standby database and if maximum database protection is configured, then the production database will shut down to ensure that there is no data divergence with the standby database.

There is no effect on production availability if the production database Oracle Net descriptor is configured to use connect-time failover to an available standby instance.

Broker will automatically restart the apply process.

Restart node and instance when they are available.

Data corruption

Restoring Fault Tolerance After a Standby Database Data Failure

Restoring Fault Tolerance After a Standby Database Data Failure

Primary database opens with RESETLOGS because of Flashback Database operations or point-in-time media recovery

Restoring Fault Tolerance After the Production Database Was Opened Resetlogs

Restoring Fault Tolerance After the Production Database Was Opened Resetlogs


4.1.2 Scheduled Outages

Scheduled outages are required for regular maintenance of the technology infrastructure that supports the application, including tasks such as:

  • Hardware maintenance, repair, and upgrades

  • Software upgrades and patching

  • Application changes and patching

  • Changes to improve performance and manageability of systems

These tasks should be scheduled at times best suited for continual application availability.

Table 4-4 describes scheduled outages that affect either the primary or secondary site.

Table 4-4 Scheduled Outages

Outage Scope Description Examples

Site-wide

The entire site where the current production database resides is unavailable. Usually known well in advance.

Scheduled power outages

Site maintenance

Regular planned switchovers to test infrastructure

Hardware maintenance (node impact)

Hardware maintenance on a database server. Restricted to a node of the database cluster.

Repair of a failed component such as a memory card or CPU board

Addition of memory or CPU to an existing node in the database tier

Hardware maintenance (clusterwide impact)

Hardware maintenance on a database server cluster

Some cases of adding a node to the cluster

Upgrade or repair of the cluster interconnect

Upgrade to the storage tier that requires downtime on the database tier

System software maintenance (node impact)

System software maintenance on a database server. The scope of the downtime is restricted to a node.

Upgrade of a software component such as the operating system

Changes to the configuration parameters for the operating system

System software maintenance (clusterwide impact)

System software maintenance on a database server cluster

Upgrade or patching of the cluster software

Upgrade of the volume management software

Oracle patch upgrade for the database

Scheduled outage for installation of an Oracle patch

Patch Oracle software to fix a specific customer issue

Oracle patch set or software upgrade for the database

Scheduled outage for Oracle patch set or software upgrade

Patching Oracle software with a patch set

Upgrading Oracle software

Database object reorganization

Changes to the logical structure or the physical organization of Oracle Database objects, primarily to improve performance or manageability.

Using Oracle Database online reorganization features enables objects to be available during the reorganization.

Moving an object to a different tablespace

Converting a table to a partitioned table

Renaming or dropping columns of a table

Storage maintenance

Maintenance of storage where database files reside

Converting to ASM

Adding or removing storage

Platform migration

Changing operating system platform of the primary and standby databases

Moving to the Linux operating system

Location migration

Changing physical location of the primary database

Moving the primary database from one data center to another.


The following sections provide best practice recommendations and preparations for reducing scheduled outages on the primary and secondary sites:

4.1.2.1 Managing Scheduled Outages on the Primary Site

If the primary site contains the production database and the secondary site contains the standby database, then outages on the primary site are the most crucial. Solutions for theses outages are critical for continued availability of the system.

Table 4-5 shows the high-level recovery steps for scheduled outages on the primary site. For outages that require multiple recovery steps, the table includes links to the detailed descriptions in Section 4.4, "Eliminating or Reducing Downtime for Scheduled Outages" beginning.

Table 4-5 Recovery Steps for Scheduled Outages on the Primary Site

Outage Scope Cause Oracle Database 10g Oracle Database 10g with RAC Oracle Database 10g with Data Guard Oracle Database 10g - MAA

Site

Site shutdown

Restart database after outage

Restart database after outage

  1. Database Switchover with a Standby Database

  2. Complete Site Failover

  3. Application Failover

  1. Database Switchover with a Standby Database

  2. Complete Site Failover

  3. Application Failover

Primary database

Hardware maintenance (node impact)

Restart database after outage

Managed automatically, see System Maintenance

  1. Database Switchover with a Standby Database

  2. Application Failover

Managed automatically, see System Maintenance

Primary database

Hardware maintenance (clusterwide impact)

System software maintenance (clusterwide impact)

Not applicable

Restart database after outage

Not applicable

  1. Database Switchover with a Standby Database

  2. Application Failover

Primary database

System software maintenance (node impact)

Restart database after outage

Managed automatically by System Maintenance

  1. Database Switchover with a Standby Database

  2. Application Failover

Managed automatically by System Maintenance

Primary database

Real Application Cluster's Cluster Ready Service (CRS) Upgrades

Not applicable

In general, CRS upgrades can be done online and do not require downtime.

Not applicable

In general, CRS upgrades can be done online and do not require downtime.

Primary database

Oracle patch upgrade for the database

Restart database after outage

RAC Database Patches

  1. Database Switchover with a Standby Database

  2. Application Failover

RAC Database Patches

Primary database

Oracle patch set or software upgrade for the database, including some Automatic Storage Management (ASM) and CRS upgrades

Restart database after outage

Restart database after outage

Database Upgrades

Database Upgrades

Primary database

Database object reorganization

Database Object Reorganization

Database Object Reorganization

Database Object Reorganization

Database Object Reorganization

Primary database

Storage maintenance

Storage Maintenance

Storage Maintenance

Storage Maintenance

Storage Maintenance

Primary database

Platform maintenance

Database Platform or Location Migration

Database Platform or Location Migration

Database Platform or Location Migration

Database Platform or Location Migration

Primary database

Location maintenance

Database Platform or Location Migration

Database Platform or Location Migration

Database Platform or Location Migration

Database Platform or Location Migration


4.1.2.2 Managing Scheduled Outages on the Secondary Site

Outages on the secondary site do not affect availability because the clients always access the primary site. Outages on the secondary site might affect the RTO if there are concurrent failures on the primary site. Outages on the secondary site can be managed with no effect on availability. If maximum protection database mode is configured, then downgrade the protection mode before scheduled outages on the standby instance or database so that there will be no downtime on the production database.

Table 4-6 describes the recovery steps for scheduled outages on the secondary site.

Table 4-6 Managing Scheduled Outages on the Secondary Site

Outage Type Cause Oracle Database 10g with Data Guard Oracle Database 10g - MAA

Site

Site shutdown

Before the outage:

Preparing for Scheduled Outages on the Secondary Site

After the outage:

Restoring Fault Tolerance After Planned Downtime on Secondary Site or Clusterwide Outage

Before the outage:

Preparing for Scheduled Outages on the Secondary Site

After the outage:

Restoring Fault Tolerance After Planned Downtime on Secondary Site or Clusterwide Outage

Standby database

Hardware or software maintenance the node that is running the managed recovery process (MRP)

Before the outage:

Preparing for Scheduled Outages on the Secondary Site

Before the outage:

Preparing for Scheduled Outages on the Secondary Site

Standby database

Hardware or software maintenance on a node that is not running the MRP

Not applicable

No effect because the primary standby node or instance receives redo logs that are applied with the managed recovery process

After the outage: Restart node and instance when available.

Standby database

Hardware or software maintenance (clusterwide impact)

Not applicable

Before the outage:

Preparing for Scheduled Outages on the Secondary Site

After the outage:

Restoring Fault Tolerance After Planned Downtime on Secondary Site or Clusterwide Outage

Standby database

Oracle patch and software upgrades

Downtime needed for upgrade, but there is no effect on primary node unless the configuration is in maximum protection database mode.

Downtime needed for upgrade, but there is no effect on primary node unless the configuration is in maximum protection database mode.


4.1.2.3 Preparing for Scheduled Outages on the Secondary Site

To achieve continued service during scheduled outages on a secondary site when in maximum protection mode, downgrade the maximum protection mode to maximum availability or maximum performance temporarily. When scheduling secondary site maintenance, consider that the duration of a site-wide or clusterwide outage adds to the time that the standby database lags behind the production database, which in turn lengthens the time to restore fault tolerance. See Section 2.4.2, "Data Protection Mode" for an overview of the Data Guard protection modes.

4.2 Recovering from Unscheduled Outages

This section describes best practices for recovering from various types of unscheduled outages.

This section contains these topics:

4.2.1 Complete Site Failover

With complete site failover, the database, the middle-tier application server, and all user connections fail over to a secondary site that is prepared to handle the production load.

4.2.1.1 When to Use Complete Site Failover

If the standby site meets the prerequisites, then complete site failover is recommended for the following scenarios:

  • Primary-site disaster, such as natural disasters or malicious attacks

  • Primary network-connectivity failures

  • Primary site power failures

4.2.1.2 Best Practices for Complete Site Failover

Site failover can be expedited in minutes by using the following practices:

  • Use Data Guard configuration best practices

  • Use Data Guard fast-start failover to automatically fail over to the standby database, with a recovery time objective (RTO) of less than 30 seconds

  • Maintain a running middle-tier application server on the secondary site to avoid the startup time

  • Automate the DNS failover procedure

Data loss is dependent on the Data Guard configuration and the use of synchronous or asynchronous redo shipping.

4.2.1.3 Repair Solution

A wide-area traffic manager on the primary and secondary sites provides the site failover function. The wide-area traffic manager can redirect traffic automatically if the primary site, or a specific application on the primary site, is not accessible. It can also be triggered manually to switch to the secondary site for switchovers. Traffic is directed to the secondary site only when the primary site cannot provide service due to an outage or after a switchover. If the primary site fails, then user traffic is directed to the secondary site automatically.

Figure 4-1 illustrates the possible network routes before site failover:

  1. Client requests enter the client tier of the primary site and travel by the WAN traffic manager.

  2. Client requests are sent through the firewall into the demilitarized zone (DMZ) to the application server tier.

  3. Requests are forwarded through the active load balancer to the application servers.

  4. Requests are sent through another firewall and into the database server tier.

  5. The application requests, if required, are routed to a RAC instance.

  6. Responses are sent back to the application and clients by a similar path.

Figure 4-1 Network Routes Before Site Failover

Description of Figure 4-1 follows
Description of "Figure 4-1 Network Routes Before Site Failover"

Figure 4-2 illustrates the network routes after site failover. Client or application requests enter the secondary site at the client tier and follow exactly the same path on the secondary site that they followed on the primary site.

Figure 4-2 Network Routes After Site Failover

Description of Figure 4-2 follows
Description of "Figure 4-2 Network Routes After Site Failover"

The following steps describe the effect on network traffic of a failover or switchover:

  1. The administrator has failed over or switched over the production database to the secondary site. This is automatic if you are using Data Guard fast-start failover.

  2. The administrator starts the middle-tier application servers on the secondary site, if they are not already running.

  3. The wide-area traffic manager selection of the secondary site can be automatic in the case of an entire site failure. The wide-area traffic manager at the secondary site returns the virtual IP address of a load balancer at the secondary site and clients are directed automatically on the subsequent reconnect. In this scenario, the site failover is accomplished by an automatic domain name system (DNS) failover.

    Alternatively, a DNS administrator can manually change the wide-area traffic manager selection to the secondary site for the entire site or for specific applications. The following is an example of a manual DNS failover:

    1. Change the DNS to point to the secondary site load balancer:

      The master (primary) DNS server is updated with the new zone information, and the change is announced with DNS NOTIFY.

      The slave DNS servers are notified of the zone update with a DNS NOTIFY announcement, and the slave DNS servers pull the new zone information.

      Note:

      The master and slave servers are authoritative name servers. Therefore, they contain trusted DNS information.
    2. Clear affected records from caching DNS servers.

      A caching DNS server is used primarily for performance and fast response. The caching server obtains information from an authoritative DNS server in response to a host query and then saves (caches) the data locally. On a second or subsequent request for the same data, the caching DNS server responds with its locally stored data (the cache) until the time-to-live (TTL) value of the response expires. At this time, the server refreshes the data from the zone master. If the DNS record is changed on the primary DNS server, then the caching DNS server does not pick up the change for cached records until TTL expires. Flushing the cache forces the caching DNS server to go to an authoritative DNS server again for the updated DNS information.

      Flush the cache if the DNS server being used supports such a capability. The following is the flush capability of common DNS BIND versions:

      BIND 9.3.0: The command rndc flushname name flushes individual entries from the cache.

      BIND 9.2.0 and 9.2.1: The entire cache can be flushed with the command rndc flush.

      BIND 8 and BIND 9 up to 9.1.3: Restarting the named server clears the cache.

    3. Refresh local DNS service caching.

      Some operating systems might cache DNS information locally in the local name service cache. If so, this cache must also be cleared so that DNS updates are recognized quickly.

      Solaris: nscd

      Linux: /etc/init.d/nscd restart

      Microsoft Windows: ipconfig /flushdns

      Apple Mac OS X: lookupd -flushcache

    4. The secondary site load balancer directs traffic to the secondary site middle-tier application server.

    5. The secondary site is ready to take client requests.

Failover also depends on the client's Web browser. Most browser applications cache the DNS entry for a period of time. Consequently, sessions in progress during an outage might not fail over until the cache timeout expires. To resume service to such clients, close the browser and restart it.

4.2.2 Database Failover with a Standby Database

Failover is the operation of transitioning one of the standby databases to the role of production database. A failover operation can be invoked when an unplanned failure occurs on the production database, and there is no possibility of recovering the production database in a timely manner.

Data Guard enables you to fail over by:

See Also:

Oracle Data Guard Broker for information about using Enterprise Manager or the Data Guard broker command-line for database Failover

Data Guard failover is a series of steps to transition a standby database into a production database. The standby database essentially assumes the role of production. A Data Guard failover is accompanied by an application failover and, in some cases, preceded by a site failover. After the Data Guard failover, the secondary site contains the production database. The former production database must be reinstated as a new standby database to restore resiliency. The standby database can be quickly re-created by using Flashback Database. See Section 4.3.2, "Restoring a Standby Database After a Failover".

With Data Guard the process of failover can be completely automated using fast-start failover, or the failover operation can be user driven, also referred to as manual failover. Fast-start failover eliminates the uncertainty of a process that requires manual intervention and automatically executes a zero data loss failover within seconds of an outage being detected. A manual failover allows for a failover process where decisions are user driven. Manual failover can be accomplished by using Oracle Enterprise Manager, by issuing commands at the Oracle Data Guard broker command-line interface, or by issuing the SQL statements described in subsequent sections.

A failover operation typically occurs in under a minute, and with little or no data loss. The complete description of a failover can be found in Oracle Data Guard Concepts and Administration.

See Also:

The following white papers for information about optimizing the Failover operation:

4.2.2.1 When To Perform a Data Guard Failover

When a primary database failure cannot be repaired in time to meet your Recovery Time Objective (RTO) using local backups or Flashback Technology, Data Guard should be used.

A manual failover, which is user initiated, should be performed due to an unplanned outage such as:

  • Site disaster which results in the primary database becoming unavailable

  • User errors that cannot be repaired in a timely fashion

  • Data failures, which impact the production application

A failover requires that the initial production database be reinstated as a standby database to restore fault tolerance to your environment. The standby database can be quickly reinstated by using Flashback Database. See Section 4.3.2, "Restoring a Standby Database After a Failover".

4.2.2.2 Best Practices for Implementing Data Guard Failover

A fast-start failover is completely automated and requires no user intervention. A manual failover, being user-driven, can be performed using Enterprise Manager, the Data Guard broker command-line interface, or SQL*Plus commands:

See Also:

Oracle Database 10g Release 2 Best Practices: Data Guard Switchover and Failover at http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm

This section contains these topics:

4.2.2.2.1 Using Enterprise Manager to Perform a Data Guard Failover

The procedure for Data Guard failover is the same for both physical and logical standby databases. The following screen shots illustrate how to perform a failover using Oracle Enterprise Manager:

In Figure 4-3 the Data Guard Overview page shows the ORA-16625 error status that indicates problems accessing the primary database.

Figure 4-3 Data Guard Overview Page Showing ORA-16625 Error

Description of Figure 4-3 follows
Description of "Figure 4-3 Data Guard Overview Page Showing ORA-16625 Error"

To transition DR_Sales into the primary role, select DR_Sales in the Standby Databases table and click Failover.

Figure 4-4 shows the Failover Confirmation page.

Figure 4-4 Failover Confirmation Page

Description of Figure 4-4 follows
Description of "Figure 4-4 Failover Confirmation Page"

If you determine that a failure occurred on the primary database and there is no possibility of recovering the primary database in a timely manner, you can start the Failover operation. In configurations with both physical and logical standby databases, Oracle recommends using the physical standby database as the failover target because it will allow the logical standby database to continue to function as a logical standby to the new primary database. If the failover is made to the logical standby, any physical standbys in the configuration will need to be re-created from a backup of the new primary database.

The failover operation enables you to choose one of the following two types of failover operations:

  • Complete

    This operation attempts to minimize data loss by applying all available redo on the standby database.

  • Immediate

    No additional data is applied on the standby database; data might be lost. This is the fastest type of failover.

Figure 4-5 shows the progress of the failover operation.

Figure 4-5 Failover Progress Page

Description of Figure 4-5 follows
Description of "Figure 4-5 Failover Progress Page"

During the failover, the selected standby database (also referred to as the target standby database) transitions into the primary role. If the failover target is a physical standby database, it is restarted. When completed, the Data Guard Overview page reflects the updated configuration, as shown in Figure 4-6.

Figure 4-6 Data Guard Overview Page After a Failover Completes

Description of Figure 4-6 follows
Description of "Figure 4-6 Data Guard Overview Page After a Failover Completes"

In the figure, the Data Guard Status column indicates that the original primary database (North_Sales) is disabled and can no longer be managed through Enterprise Manager until it has been re-enabled as a physical standby database.

4.2.2.2.2 Using SQL to Fail Over to a Physical Standby Database

Follow these steps to fail over to a physical standby database:

  1. If the standby database is a Real Application Clusters database, then issue a SHUTDOWN ABORT on all additional standby instances.

  2. Initiate the failover by issuing the following SQL command on the target standby database:

    ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH;
    
    

    If the network between the primary and standby sites is unavailable, then the standby RFS processes will wait for the network connections to time out through normal TCP timeout processing before shutting down. While the RFS processes are in this TCP timeout processing, the standby database will not be able to fail over unless you include the FORCE keyword on the RECOVER MANAGED STANDBY DATABASE FINISH statement.

  3. Convert the physical standby database to the primary role:

    ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;
    
    
  4. If the standby database was never opened read-only since the last time it was started, then open the new primary database by issuing the following SQL statement:

    ALTER DATABASE OPEN;
    
    
  5. If the standby database has been opened read-only, then restart the new primary database before starting Redo Apply.

4.2.2.2.3 Using SQL to Fail Over to a Logical Standby Database

Follow these steps to fail over to a logical standby database:

  1. If the standby database is a Real Application Clusters database , then issue a SHUTDOWN ABORT on all additional standby instances.

  2. Initiate the failover by issuing the following SQL command on the target standby database:

    ALTER DATABASE ACTIVATE LOGICAL STANDBY DATABASE FINISH APPLY;
    
    

    This statement stops the RFS process, applies any remaining redo data, stops SQL Apply, and activates the logical standby database in the primary role. To avoid waiting for the redo in the standby redo log file to be applied prior to performing the failover, omit the FINISH APPLY clause on the statement.

    Although omitting the FINISH APPLY clause will accelerate failover, omitting it will also cause the loss of any unapplied redo data in the standby redo log. To gauge the amount of redo that will be lost, query the V$LOGSTDBY_PROGRESS view. The LATEST_SCN column value indicates the last SCN received from the primary database, and the APPLIED_SCN column value indicates the last SCN applied to the standby database. All SCNs between these two values will be lost.

4.2.3 Database Switchover with a Standby Database

A database switchover performed by Oracle Data Guard is a planned transition that includes a series of steps to switch roles between a standby database and a production database. Following a successful switchover operation, the standby database assumes the production role and the production database becomes a standby database. In a RAC environment, a switchover requires that only one instance is active for each database, production and standby. At times the term switchback is also used within the scope of database role management. A switchback operation is a subsequent switchover operation to return the roles to their original state.

Data Guard enables you to change these roles dynamically by:

See Also:

Oracle Data Guard Broker for information about using Enterprise Manager or the Data Guard broker command-line for database switchover

4.2.3.1 When to Perform a Data Guard Switchover

Switchover is a planned operation. Switchover is the capability to switch database roles between the production and standby databases. Switchover can occur whenever a production database is started, the target standby database is available, and all the archived redo logs are available. Switchovers are typically completed in less than five minutes and, in some cases, are optimized to be less than a minute. Switchovers are useful in the following situations:

  • Scheduled maintenance such as hardware maintenance or firmware patches on the production host

  • Resolution of data failures when the production database is still opened

  • Testing and validating the secondary resources, as a means to test disaster recovery readiness

Switchover is not possible or practical under the following circumstances:

  • Missing archived redo logs

  • Point-in-time recovery is required

  • The production database is not open and cannot be opened

4.2.3.2 Best Practices for Implementing Data Guard Switchover

Before performing a switchover the following procedural best practices should be followed in addition to the configuration best practices outlined in Section 2.4.7.1.1, "Switchover Best Practices":

  1. Disconnect all sessions possible by issuing SQL command ALTER SYSTEM KILL SESSION.

  2. Stop job processing by setting the AQ_TM_PROCESSES to 0.

  3. Cancel any specified apply delay by using the NODELAY keyword to stop and restart log apply services on the standby database:

    • On a physical standby database:

      ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
      ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE NODELAY;
      
      
    • On a logical standby database:

      ALTER DATABASE STOP LOGICAL STANDBY APPLY;
      ALTER DATABASE START LOGICAL STANDBY APPLY IMMEDIATE NODELAY;
      
      

See Also:

Oracle Database 10g Release 2 Best Practices: Data Guard Switchover and Failover at http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm

This following sections describe how to perform a switchover:

4.2.3.2.1 Using Enterprise Manager to Perform a Data Guard Switchover

The procedure for Data Guard switchover using Enterprise Manager is the same for both physical and logical standby databases:

  1. Select the standby database that you want to become the primary database.

  2. Click Switchover.

  3. Click Yes to continue with the switchover. Click No to cancel.

Figure 4-7 shows the switchover confirmation page.

Figure 4-7 Switchover Operation Confirmation

Description of Figure 4-7 follows
Description of "Figure 4-7 Switchover Operation Confirmation"

Figure 4-8 shows the processing page during the switchover.

Figure 4-8 Processing Page During Switchover

Description of Figure 4-8 follows
Description of "Figure 4-8 Processing Page During Switchover"

Figure 4-9 shows the Data Guard overview page after a successful switchover.

Figure 4-9 New Primary Database After Switchover

Description of Figure 4-9 follows
Description of "Figure 4-9 New Primary Database After Switchover"

4.2.3.2.2 Using SQL for Data Guard Switchover to a Physical Standby Database

If you are not using Oracle Enterprise Manager, then the high-level steps in this section can be performed using with SQL*Plus. These steps are described in detail in Oracle Data Guard Concepts and Administration.

Follow these steps for a switchover to a physical standby database:

  1. If possible, disconnect user sessions and disable or stop application processing.

  2. If the primary and standby databases are RAC, then cleanly shut down all instances except one. To expedite this operation, issue a SHUTDOWN ABORT.

  3. Issue the following SQL statement on the primary database to convert it to a standby database:

    ALTER DATABASE COMMIT TO SWITCHOVER TO STANDBY WITH SESSION SHUTDOWN;
    
    
  4. After the statement in the previous step completes:

    1. Issue the following SQL statement on the old standby database:

      ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;
      
      
    2. Immediately after issuing the COMMIT TO SWITCHOVER TO PRIMARY statement, restart the old primary database as the new standby database and bring it to the mount state.

  5. When the switchover command completes, issue the ALTER DATABASE OPEN statement on the new primary database to bring it to the open state.

    Opening the new primary database from the mount state is possible only if the standby database was never opened read-only since the last time the database was started. If the database has been opened read-only, it will need to be restarted.

  6. If the primary and standby databases are RAC, then start all instances.

  7. Restart user sessions and application processing.

4.2.3.2.3 Using SQL for Data Guard Switchover to a Logical Standby Database

If you are not using Oracle Enterprise Manager, then the high-level steps in this section can be executed with SQL*Plus. These steps are described in detail in Oracle Data Guard Concepts and Administration.

When performing a switchover using SQL*Plus commands it is possible for the old standby database that is to become the new primary database to build and transmit the LogMiner dictionary to the current primary database (the new standby database) prior to the actual switchover. This reduces the total time needed to perform the switchover. The following steps describe how to perform this optimized method:

  1. Issue the following SQL statement on the primary database to enable receipt of redo from the current standby database:

    ALTER DATABASE PREPARE TO SWITCHOVER TO LOGICAL STANDBY;
    
    
  2. On the current logical standby database, build the LogMiner dictionary and transmit this dictionary to the current primary:

    ALTER DATABASE PREPARE TO SWITCHOVER TO PRIMARY;
    
    
  3. If possible, disconnect user sessions and disable or stop application processing.

  4. If the primary and standby databases are RAC, then cleanly shut down all instances except one. To optimize the shutdown operations, use SHUTDOWN ABORT.

  5. When the SWITCHOVER_STATUS column of the V$DATABASE view returns TO LOGICAL STANDBY, convert the primary database to a standby by issuing:

    ALTER DATABASE COMMIT TO SWITCHOVER TO LOGICAL STANDBY WITH SESSION SHUTDOWN;
    
    
  6. Issue the following statement on the old standby database:

    ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;
    
    
  7. If the primary and standby databases are configured in a RAC, then start all instances.

  8. Restart user sessions and application processing.

4.2.4 RAC Recovery for Unscheduled Outages

This solution is leveraged automatically when there is a node or instance failure. Surviving instances will automatically recover the failed instances and potentially aid in the automatic client failover. Recover times can be bounded by adopting the database and RAC configuration best practices and can usually lead to instance recovery times of seconds to minutes in very large busy systems with no data loss.

The following recovery methods can be used:

4.2.4.1 Automatic Instance Recovery for Failed Instances

Instance failure occurs when software or hardware problems disable an instance. After instance failure, Oracle automatically uses the online redo log file to perform database recovery as described in this section.

Instance recovery in RAC does not include restarting the failed instance or the recovery of applications that were running on the failed instance. Applications that were running continue by using service relocation and fast application notification (as described in Section 4.2.4.2, "Automatic Service Relocation").

When one instance performs recovery for another instance, the recovering instance:

  • Reads redo log entries generated by the failed instance and uses that information to ensure that committed transactions are recorded in the database. Thus, data from committed transactions is not lost

  • Rolls back uncommitted transactions that were active at the time of the failure and releases resources used by those transactions

When multiple node failures occur, as long as one instance survives, RAC performs instance recovery for any other instances that fail. If all instances of a RAC database fail, then on subsequent restart of any one instance, crash recovery will occur and all committed transactions will be recovered. If Data Guard is available, you can fail over automatically with Data Guard fast-start failover once all instances are down.

4.2.4.2 Automatic Service Relocation

Service reliability is achieved by configuring and failing over among redundant instances. More instances are enabled to provide a service than would otherwise be needed. If a hardware failure occurs and adversely affects a RAC database instance, then CRS automatically moves any services on that instance to another available instance, as configured with DBCA or Enterprise Manager. Then, Cluster Ready Services (CRS) attempt to restart the failed nodes and instances.

CRS recognizes when a failure affects a service and automatically fails over the service and redistributes the clients across the surviving instance supporting the service. In parallel, CRS attempts to restart and integrate the failed instances and dependent resources back into the system. Notification of failures using fast application notification (FAN) events occur at various levels within the Oracle Server architecture. The response can include notifying external parties through Oracle Notification Service (ONS), advanced queueing, or FAN callouts, recording the fault for tracking, event logging, and interrupting applications. Notification occurs from a surviving node when the failed node is out of service. The location and number of nodes serving a service is transparent to applications. Auto restart and recovery are automatic, including all the subsystems, such as the listener and the ASM instance, not just database.

4.2.4.3 Oracle Cluster Registry

Loss of the Oracle Cluster Registry file affects the availability of RAC and Cluster Ready Services. The OCR file can be restored from a physical backup that is automatically created or from an export file that is manually created by using the ocrconfig tool. Additionally, starting with Oracle Database 10g Release 10.2, Oracle can optionally mirror the OCR so that a single OCR device failure can be tolerated.

4.2.5 Application Failover

With proper configuration, applications can be configured to receive fast and efficient notification when application services become unavailable. Once notified, application connections occur transparently to surviving resources such as nodes in a RAC database or a standby database in a remote datacenter than has assumed the primary role following a failover.

In a RAC configuration, services are essential to achieving fast and transparent application failover. If a service becomes unavailable for a particular instance, the service will fail over to an available instance in the cluster, thereby allowing applications to continue processing. Clients are notified of the service relocation through Fast Application Notification (FAN).

Services are also essential for client failover across sites in a Data Guard configuration. After a site failure in a Data Guard configuration, the new production database will also be configured to automatically publish the production service while notifying affected clients that the services are no longer available on the failed production database through FAN events.

FAN notifications and service relocation enable automatic and fast redirection of clients in the event of a failures for both RAC and Data Guard environments.

See Also:

4.2.6 ASM Recovery After Disk and Storage Failures

The impacts and recommended repairs for various ASM failure types are summarized in Table 4-7.

Table 4-7 Types of ASM Failures and Recommended Repair

Failure Description Impact Recommended Repair

ASM instance failure

ASM instance fails

All database instances accessing ASM storage from the same node will shut down.

Automatic RAC Recovery for Unscheduled Outages

If RAC is not used, use Data Guard failover (see Section 4.2.2.2, "Best Practices for Implementing Data Guard Failover")

If RAC and Data Guard are not used, fix the underlying problem and then restart ASM and the database instances

ASM disk failure

One or more ASM disks fail, but all disk groups remain online.

All data remains accessible. This is possible only with normal or high redundancy disk groups.

ASM automatically rebalances to the remaining disk drives and reestablishes redundancy. There must be enough free disk space in the remaining disk drives to restore the redundancy or the rebalance may fail with an ORA-15041

Note: External redundancy disk groups should use mirroring in the storage array to protect from disk failure. Disk failures should not be exposed to ASM.

Data area disk-group failure

One or more ASM disks fail, and data area disk group goes offline.

Databases accessing the data area disk group will shut down.

Perform Data Guard failover or local recovery as described in Section 4.2.6.3, "Data Area Disk Group Failure"

Flash recovery area disk-group failure

One or more ASM disks fail, and the flash recovery area disk group goes offline.

Databases accessing the flash recovery area disk group will shut down.

Perform local recovery or Data Guard failover as described in Section 4.2.6.4, "Flash Recovery Area Disk Group Failure"


4.2.6.1 ASM Instance Failure

If the ASM instance fails, database instances accessing ASM storage from the same node will shut down.If the primary database is using RAC, then application failover will occur automatically and clients connected to the database instance will reconnect to remaining instances providing the required service in the cluster and continue processing. The recovery time is typically in seconds.

If the primary database does not use RAC, then an ASM instance failure will shut down the entire database. If Data Guard is being used and Data Guard fast-start failover is configured, a database failover will be triggered automatically and clients will automatically reconnect to the new primary database following the failover. The recovery time is the time it takes to complete an automatic Data Guard fast-start failover operation. If fast-start failover is not configured, then recovering from this outage is a manual process, which can be accomplished by either restarting the ASM and database instances manually, or by performing a Data Guard failover. If neither RAC nor Data Guard is being used, then restart the ASM instance and restart database instances manually. The recovery time depends on the length of time to start the ASM instance, and the length of time to start the database instances and perform crash recovery.

4.2.6.2 ASM Disk Failure

If an ASM disk group is configured as an external redundancy type, then a failure of a single disk should be handled by the storage array and should not be seen by the ASM instance, and all operations of ASM and databases using the disk group will continue normally. However, if the failure of an external redundancy disk group is seen by the ASM instance, then the ASM instance will take the disk group offline immediately, causing Oracle instances accessing the disk group to crash. If the disk failure is temporary, then ASM and the database instances can be restarted, and crash recovery will occur after the disk group is brought back online. If an ASM disk group is configured as a normal or a high-redundancy type, then disk failure is handled transparently by ASM and the databases accessing the disk group are not affected. An ASM instance automatically starts an ASM rebalance operation to distribute the data on one or more failed disks to alternative disks in the ASM disk group. While the rebalance operation is in progress, subsequent disk failures may affect disk group availability if the disk contains data that has yet to be re-mirrored. When the rebalance operation completes successfully, the ASM disk group is no longer at risk in the event of a subsequent failure. Multiple disk failures are handled similarly, provided the failures affect only one failure group in an ASM disk group.The failure of multiple disks in multiple failure groups where a primary extent and all of its mirror extents have been lost will cause the disk group to go offline.

This following recovery methods can be used:

4.2.6.2.1 Using Enterprise Manager to Repair ASM Disk Failure

Figure 4-10 shows Enterprise Manager reporting disk failures. Three alerts appear at 11:19:29. The first alert is an Offline Disk Count. The second and third alerts are Disk Status messages for data area disk DATA.XBBT1D06_DATA and recovery area disk RECO.XBBT1D06_RECO:

2 disks are offline
Disk DATA.XBBT1D06_DATA is offline.
Disk RECO.XBBT1D06_RECO is offline.

Figure 4-10 Enterprise Manager Reports Disk Failures

Description of Figure 4-10 follows
Description of "Figure 4-10 Enterprise Manager Reports Disk Failures"

Figure 4-11 shows Enterprise Manager reporting the status of data area disk group DATA and recovery area disk group RECO. The red arrows under Member Disks indicate that one disk has failed in each disk group. The numbers under Pending Operations indicate that one operation is pending for each disk group.

Figure 4-11 Enterprise Manager Reports ASM Disk Groups Status

Description of Figure 4-11 follows
Description of "Figure 4-11 Enterprise Manager Reports ASM Disk Groups Status"

Figure 4-12 shows Enterprise Manager reporting a pending REBAL operation on the DATA disk group. The operation is about one-third done, as shown in % Complete, and the Remaining Time is estimated to be 16 minutes.

Figure 4-12 Enterprise Manager Reports Pending REBAL Operation

Description of Figure 4-12 follows
Description of "Figure 4-12 Enterprise Manager Reports Pending REBAL Operation"

4.2.6.2.2 Using SQL to Add Replacement Disks Back to the Disk Group

Perform these steps after one or more failed disks have been replaced, and access to the storage has been restored:

  1. Add the one or more replacement disks to the failed disk group with the following SQL command:

    ALTER DISKGROUP disk_group 
       ADD FAILGROUP failure_group 
       DISK 'disk1','disk2',...;
    
    
  2. Check the progress of the operation:

    SELECT * FROM V$ASM_OPERATION;
    

4.2.6.3 Data Area Disk Group Failure

A data area disk group failure should occur only when there have been multiple failures. For example, if the data-area disk group is defined as external redundancy, a single-disk failure should not be exposed to ASM. However, multiple disk failures in a storage array may be seen by ASM causing the disk group to go offline. Similarly, multiple disk failures in different failure groups in a normal or high-redundancy disk group may cause the disk group to go offline.

When one or more disks fail in a normal or high redundancy disk group, but the ASM disk group is accessible, there is no loss of data and no immediate loss of accessibility. An ASM instance automatically starts an ASM rebalance operation to distribute the data on the one or more failed disks to alternative disks in the ASM disk group. When the rebalance operation completes successfully, the ASM disk group is no longer at risk in the event of a second failure. There must be enough disk space on the remaining disks in the disk group for the rebalance to complete successfully.

The two possible solutions when the data area disk group fails are summarized in Table 4-8.

Table 4-8 Recovery Options for Data Area Disk Group Failure

Recovery Option Recovery Time Objective (RTO) Recovery Point Objective (RPO)

Data Guard failover

Five minutes or less

Varies depending on the data protection level chosen

Local recovery

Database restore and recovery time

Zero


If Data Guard is being used and fast-start failover is configured, then an automatic failover will occur when the database shuts down due to the data-area disk group going offline. If fast-start failover is not configured, then perform a manual failover.

If you decide to perform a Data Guard failover, then the recovery time objective (RTO) will be expressed in terms of minutes or perhaps seconds, depending on the presence of the Data Guard observer process and fast-start failover. However, if a manual failover occurs and not all data is on the standby site, then data loss might result.

After Data Guard failover has completed, and the application is available, the data area disk group failure must still be resolved. Continue with the following "Local Recovery Steps" procedure to resolve the ASM disk group failure.

The RTO for local recovery only is based primarily on the time required to:

  1. Repair and replace the failed storage components

  2. Restore and recover the database

Because the loss affects only the data-area disk group, there is no loss of data. All transactions are recorded in the Oracle redo log members that reside in the flash recovery area, so complete media recovery is possible.

If Data Guard is not being used, then perform the following local recovery steps. The time required to perform local recovery depends on how long it takes to restore and recover the database. There is no data loss when performing local recovery.

Local Recovery Steps

Perform these steps after one or more failed disks have been replaced and access to the storage has been restored:

Note:

If you have performed an Oracle Data Guard failover to a new primary database, then you can now use the following procedure to reintroduce the database into the Data Guard environment. Also, see Section 4.3.2, "Restoring a Standby Database After a Failover".
  1. Rebuild the ASM disk group using the new storage location:

    SQL> CREATE DISKGROUP DATA NORMAL REDUNDANCY DISK 'path1','path2',...;
    
    
  2. Start the instance NOMOUNT:

    RMAN> STARTUP FORCE NOMOUNT;
    
    
  3. Restore the control file from the surviving copy located in the recovery area:

    RMAN> RESTORE CONTROLFILE FROM 'recovery_area_controlfile';
    
    
  4. Start the instance MOUNT:

    RMAN> STARTUP FORCE MOUNT;
    
    
  5. Restore the database:

    RMAN> RESTORE DATABASE
    
    
  6. Recover the database:

    RMAN> RECOVER DATABASE;
    
    
  7. If you use block change tracking, then disable and reenable the block change tracking file:

    SQL> ALTER DATABASE DISABLE BLOCK CHANGE TRACKING;
    SQL> ALTER DATABASE ENABLE BLOCK CHANGE TRACKING;
    
    
  8. Open the database:

    SQL> ALTER DATABASE OPEN;
    
    
  9. Re-create the log file members on the failed ASM disk group:

    SQL> ALTER DATABASE DROP LOGFILE MEMBER 'filename';
    SQL> ALTER DATABASE ADD LOGFILE MEMBER 'disk_group' TO GROUP group_no;
    
    
  10. Perform a new incremental level 0 backup:

    RMAN> BACKUP INCREMENTAL LEVEL 0 DATABASE;
    

4.2.6.4 Flash Recovery Area Disk Group Failure

When the flash recovery-area disk group fails, the database crashes because the control file member usually resides in the flash recovery area and Oracle requires that all control file members are accessible. The flash recovery area can also contain the flashback logs, redo log members and all backups.

A flash recovery area disk group failure should occur only when there have been multiple failures. For example, if the flash recovery-area disk group is defined as external redundancy, a single-disk failure should not be exposed to ASM. However, multiple-disk failures in a storage array may be seen by ASM causing the disk group to go offline. Similarly, multiple-disk failures in different failure groups in a normal or high-redundancy disk group may cause the disk group to go offline.

Table 4-9 summarizes the two possible solutions when the flash recovery-area disk group fails.

Table 4-9 Recovery Options for Flash Recovery Area Disk Group Failure

Recovery Option Recovery Time Objective (RTO) Recovery Point Objective (RPO)

Local Recovery

Five minutes or less

Zero

Data Guard Failover or Switchover

Five minutes or less

Zero


Because the loss affects only the flash recovery-area disk group, there is no loss of data. No database media recovery is required, because the data files and the online redo log files are still present and available in the data area. A fast local restart is to startup the primary database after removing the controlfile member located in the failed flash recovery area and point to a new flash recovery area for local archiving (see "Local Restart Steps" discussion later in this section for the step-by-step procedure). However, this is a temporary fix until a new flash recovery area is created to replace the failed storage components. Oracle recommends using the "Local Recovery Steps" discussion later in this section.

If you decide to perform a Data Guard failover, then the RTO will be expressed in terms of minutes or perhaps seconds depending on the presence of the Data Guard observer process and fast-start failover. After Data Guard failover has completed, and the application is available, the flash recovery area disk group failure must still be resolved. Continue with the instructions in the following "Local Recovery Steps" section to resolve the ASM disk group failure.

If the protection level is maximum performance or the standby database is unsynchronized with the primary database, then temporarily start up the primary database by removing the controlfile member and pointing to a temporary flash recovery area (file system) in the SPFILE. Issue a Data Guard switchover to ensure no data loss. After Data Guard switchover has completed, and the application is available, the flash recovery area disk group failure must still be resolved. Shut down the affected database and continue with the instructions in the following "Local Recovery Steps" section to resolve the ASM disk group failure.

The RTO for local recovery only is based primarily on the time to repair and replace the failed storage components and then on the time to restore the control-file copy. Because the loss affects only the flash recovery-area disk group, there is no loss of data. No database media recovery is required, because the data files and the online redo log files are still present and available in the data area. As mentioned previously, you can start up the primary database by removing the controlfile member and pointing to a new flash recovery area. However, this is a temporary fix filled with availability and performance risks unless the flash recovery area is configured properly. Therefore, Oracle recommends the "Local Recovery Steps" that follow.

Local Restart Steps

For a fast local restart, perform the following steps on the primary database:

  1. Change the CONTROL_FILES initialization parameter to refer only to members in the Data Area. For example:

    ALTER SYSTEM SET CONTROL_FILES='+DATA/sales/control1.dbf' SCOPE=spfile;
    
    
  2. Change local archive destinations and/or the flash recovery area to the local redundant, scalable destination. For example:

    ALTER SYSTEM SET DB_RECOVERY_FILE_DEST='+DATA' SCOPE=spfile;
    
    
  3. Startup with new settings:

    STARTUP MOUNT:
    
    

You may need to disable and reenable Flashback Database because the flashback logs were damaged or lost:

ALTER DATABASE FLASHBACK OFF;
ALTER DATABASE FLASHBACK ON;
ALTER DATABASE OPEN;

Local Recovery Steps

Note:

If you performed an Oracle Data Guard failover to a new primary database, then you cannot use this procedure to reintroduce the old primary database as a standby database. This is because Flashback Database log files that are required as part of reintroducing the database have been lost. You must perform a full reinstantiation of the standby database.
  1. Replace or get access to new storage to be leveraged as flash recovery area

  2. Rebuild the ASM disk group using the new storage location:

    SQL> CREATE DISKGROUP RECO NORMAL REDUNDANCY DISK 'path1','path2',...;
    
    
  3. Start the instance NOMOUNT:

    RMAN> STARTUP FORCE NOMOUNT;
    
    
  4. Restore the control file from the surviving copy located in the data area:

    RMAN> RESTORE CONTROLFILE FROM 'data_area_controlfile';
    
    
  5. Start the instance MOUNT:

    RMAN> STARTUP FORCE MOUNT;
    
    
  6. If you use Flashback Database, then disable it:

    SQL> ALTER DATABASE FLASHBACK OFF;
    
    
  7. Open the database and allow instance recovery to complete:

    SQL> ALTER DATABASE OPEN;
    
    
  8. Issue the following statements only if Flashback Database is required:

    SQL> SHUTDOWN IMMEDIATE;
    SQL> STARTUP MOUNT;
    SQL> ALTER DATABASE FLASHBACK ON;
    SQL> ALTER DATABASE OPEN;
    
    
  9. Re-create the log file members on the failed ASM disk group:

    SQL> ALTER DATABASE DROP LOGFILE MEMBER 'filename';
    SQL> ALTER DATABASE ADD LOGFILE MEMBER 'disk_group' TO GROUP group_no;
    
    
  10. Synchronize the control file and the flash recovery area:

    RMAN> CATALOG RECOVERY AREA;
    RMAN> CROSSCHECK ARCHIVELOG ALL;
    RMAN> CROSSCHECK BACKUPSET;
    RMAN> CROSSCHECK DATAFILECOPY ALL;
    RMAN> LIST EXPIRED type;
    RMAN> DELETE EXPIRED type;
    
    
  11. Assuming that data has been lost in some way, take a new backup:

    RMAN> BACKUP INCREMENTAL LEVEL 0 DATABASE;
    

4.2.7 Recovering from Data Corruption (Data Failures)

Recovering from data corruption is an unscheduled outage scenario. Data corruption is usually—but not always—caused by some activity or failure that occurs outside the database, even though the problem might be evident within the database.

Data corruption in data files has two categories:

  • Data file block corruption

    A corrupt data file block can be accessed, but the contents in the block are invalid or inconsistent. The typical cause of data file corruption is a faulty hardware or software component in the I/O stack, which includes, but is not limited to, the file system, volume manager, device driver, host bus adapter, storage controller, and disk drive.

    The database usually remains available when corrupt blocks have been detected, but some corrupt blocks might cause widespread problems, such as corruption in a file header or with a data dictionary object, or corruption in a critical table that renders an application unusable.

    A data fault is detected when it is recognized by the user, administrator, RMAN backup, or application because it has affected the availability of the application. For example:

    • A single corrupt data block in a user table that cannot be read by the application because of a bad spot of the physical disk

    • A single corrupt data block because of block inconsistencies detected by Oracle. The block will be marked corrupted and any application accessing the block will receive an ORA-1578 error.

    • A database that automatically shuts down because of the invalid blocks of a data file in the SYSTEM tablespace caused by a failing disk controller

  • Media failure

    This category of data corruption results from a physical hardware problem or user error. The system cannot successfully read or write to a file that is necessary to operate the database.

In all environments, you can resolve a data corruption outage by one of the following methods:

  • RMAN block media restoration and recovery

  • Data Guard switchover or failover to a standby database

  • RMAN datafile media restoration and recovery

  • Manually re-create the object

RMAN block media restoration and recovery provides the highest application availability if targeted blocks are not critical to application functionality. Data Guard switchover or failover to standby database provides the fastest predictable RTO.

Other outages that result in database objects becoming unavailable or inconsistent are caused by human error, such as dropping a table or erroneously updating table data. Information about recovering from human error can be found in Section 4.2.8, "Recovering from Human Error".

If the data corruption impacts nondata files, then the repair may be slightly different. Table 4-10 provides a matrix of the key non database object corruption and the recommended repair.

Table 4-10 Non Database Object Corruption and Recommended Repair

Object or Component Affected Impact Repair

Any control file

Database fails

Data Guard fast-start failover will automatically fail over to the standby database

Redo log member

None

  1. Investigate failure and check system

  2. Drop and re-create redo log member

Active log group that is archived and not needed for crash recovery

Database fails

Restart database after dropping affected redo log group

Active redo log group not archived and not needed for crash recovery

Database fails

  1. Restart database after dropping affected log group

  2. Create a new backup

  3. Refresh the standby database either by applying an incremental backup or re-creating the standby database from the primary or a backup of the primary database

Active or current redo log group that is still needed for crash recovery

Database fails

Use one of the following solutions:

  • Data Guard failover

  • Flashback Database—flash the database back to a consistent time and then issue an OPEN RESETLOGS

Archived redo log file

None

  1. Create database backup

  2. Refresh the standby database either by applying an incremental backup or re-creating the standby database from the primary or a backup of the primary database

SPFILE

None

Restore SPFILE from a backup and revise


The following recovery methods can be used:

4.2.7.1 Use Data Guard to Recover From Data Corruption and Data Failure

Failover is the operation of transitioning the standby databases as the new production database. A database switchover is a planned transition in which a standby database and a production database switch roles. Either of these operations can occur in less than 5 minutes and with no data loss.Use Data Guard switchover or failover for data corruption or data failure when:

  • The database is down or when the database is up but the application is unavailable because of data corruption or failure, and the time to restore and recover locally is long or unknown.

  • Recovering locally will be longer than the business SLA or RTO.

4.2.7.2 Use RMAN Block Media Recovery

Block media recovery (BMR) recovers one block or a set of data blocks marked "media corrupt" in a data file by using the RMAN BLOCKRECOVER command. When a small number of data blocks are marked media corrupt and require media recovery, you can selectively restore and recover damaged blocks rather than whole data files. This results in lower RTO because only blocks that need recovery are restored and only necessary corrupt blocks undergo recovery. Block media recovery minimizes redo application time and avoids I/O overhead during recovery. It also enables affected data files to remain online during recovery of the corrupt blocks. The corrupt blocks, however, remain unavailable until they are completely recovered.

Use block media recovery when:

  • A small number of blocks require media recovery and the blocks that need recovery are known. If a significant portion of the datafile is corrupt, or if the amount of corruption is unknown, then a different recovery method should be used.

  • Blocks are marked corrupt (verified with the RMAN BACKUP VALIDATE command) and only when complete recovery is required.

  • Backup of the data file containing the corrupted blocks is available locally or can be retrieved from a remote location including from a a physical standby database.

Block media recovery cannot be used to recover from the following:

  • User error or software bugs that cause logical corruption where the data blocks are intact. See Section 4.2.8, "Recovering from Human Error" for additional details for this type of recovery.

  • Changes caused by corrupt redo data. Block media recovery requires that all available redo data be applied to the blocks being recovered.

For example, to recover a specific corrupt block using RMAN block media recovery:

RMAN> BLOCKRECOVER DATAFILE 7 BLOCK 3;

When the corruption is detected, it would be easy to recover this block through Grid Control.

See Also:

Oracle Database Backup and Recovery Advanced User's Guide and the white paper titled Using Recovery Manager with Oracle Data Guard in Oracle Database 10g available at http://www.oracle.com/technology/deploy/availability/pdf/RMAN_DataGuard_10g_wp.pdf

4.2.7.3 Use RMAN Data File Media Recovery

Data file media recovery recovers an entire datafile or set of data files for a database by using the RMAN RECOVER command. When a large or unknown number of data blocks are marked media-corrupt and require media recovery, or when an entire file is lost, the affected data files must be restored and recovered.

Use RMAN file media recovery when the following conditions are true:

4.2.7.4 Re-Create Objects Manually

Some database objects, such as small look-up tables or indexes, can be recovered quickly by manually re-creating the object instead of doing media recovery.

Use manual object re-creation when:

  • You must re-create a small index because of media corruption. Creating an index online enables the base object to be used concurrently.

  • You must re-create a look-up table or when the scripts to re-create the table are readily available. Dropping and re-creating the table might be the fastest option.

4.2.8 Recovering from Human Error

Oracle flashback technology revolutionizes data recovery. In the past it took seconds to damage a database but hours to days to recover it. With flashback technology, the time to correct errors can be as short as the time it took to make the error. Fixing human errors that require rewinding the database, table, transaction, or row level changes to a previous point in time is easy and does not require any database or object restoration. Flashback technology provides fine-grained analysis and repair for localized damage such as erroneous row deletion. Flashback technology also enables correction of more widespread damage such as accidentally running the wrong application batch job. Furthermore, flashback technology is exponentially faster than a database restoration.

Flashback technologies are applicable only to repairing the following human errors:

  • Erroneous or malicious update, delete or insert transactions

  • Erroneous or malicious DROP TABLE statements

  • Erroneous or malicious batch job or wide-spread application errors

Flashback technologies cannot be used for media or data corruption such as block corruption, bad disks, or file deletions. See Section 4.2.7, "Recovering from Data Corruption (Data Failures)" and Section 4.2.2, "Database Failover with a Standby Database" to repair these outages.

Table 4-23 summarizes the flashback solutions for each type of outage.

Table 4-11 Flashback Solutions for Different Outages

Impact of Outage Examples of Human Errors Flashback Solutions

Row or transaction

See Also: "Resolving Row and Transaction Inconsistencies"

Accidental deletion of row

Erroneous transaction

Flashback Query

Flashback Version Query

Flashback Transaction Query

Table

See Also:"Resolving Table Inconsistencies"

Dropped table

Erroneous transactions affecting one table or a set of tables

Flashback Drop

Flashback Table

Tablespace or database

See Also: "Resolving Database-Wide Inconsistencies"

Erroneous batch job affecting many tables or an unknown set of tables

Series of database-wide malicious transactions

Flashback Database


Table 4-12 summarizes each flashback feature.

Table 4-12 Summary of Flashback Features

Flashback Feature Description

Flashback Query

Flashback Query enables you to view data at a point in time in the past. It can be used to view and reconstruct lost data that was deleted or changed by accident. Developers can use this feature to build self-service error correction into their applications, empowering end users to undo and correct their errors.

Note: Changes are propagated to physical and logical standby databases.

Flashback Version Query

Flashback Version Query uses undo data stored in the database to view the changes to one or more rows along with all the metadata of the changes.

Note: Changes are propagated to physical and logical standby databases.

Flashback Transaction Query

Flashback Transaction Query enables you to examine changes to the database at the transaction level. As a result, you can diagnose problems, perform analysis, and audit transactions.

Note: Changes are propagated to physical and logical standby databases.

Flashback Drop

Flashback Drop provides a way to restore accidentally dropped tables.

Note: Changes are propagated to physical standby databases.

Flashback Table

Flashback Table enables you to quickly recover a table to a point in time in the past without restoring a backup.

Note: Changes are propagated to physical and logical standby databases.

Flashback Database

Flashback Database enables you to quickly return the database to an earlier point in time by undoing all of the changes that have taken place since that time. This operation is fast because you do not have to restore the backups.


Flashback Database uses the Oracle Database flashback logs, while all other features of flashback technology use the Oracle Database unique undo and multiversion read consistency capabilities. See the configuration best practices for the database—as documented in Section 2.2, "Configuring Oracle Database 10g"—for configuring flashback technologies to ensure that the resources from these solutions are available at a time of failure.

See Also:

Oracle Database Administrator's Guide, Oracle Database Backup and Recovery Basics, and Oracle Database Concepts for more information about flashback technology and automatic undo management

In general, the recovery time when using flashback technologies is equivalent to the time it takes to cause the human error plus the time it takes to detect the human error.

Flashback technologies allow recovery up to the point that the human error occurred.

The following recovery methods can be used:

4.2.8.1 Resolving Table Inconsistencies

Oracle provides a FLASHBACK DROP statement to recover from an accidental DROP TABLE statement, and a FLASHBACK TABLE statement to restore a table to a previous point in the database.

Flashback Table provides the DBA the ability to recover a table, or a set of tables, to a specified point in time quickly and easily. In many cases, Flashback Table alleviates the more complicated point in time recovery operations. For example:

FLASHBACK TABLE orders, order_items 
      TO TIMESTAMP 
      TO_DATE('28-Jun-06 14.00.00','dd-Mon-yy hh24:mi:ss');

This statement rewinds any updates to the ORDERS and ORDER_ITEMS tables that have been done between the current time and specified timestamp in the past. Flashback Table performs this operation online and in place, and it maintains referential integrity constraints between the tables.

Dropping or deleting database objects by accident is a common mistake. Users soon realize their mistake, but by then it is too late and there has been no way to easily recover the dropped tables and its indexes, constraints, and triggers. Objects once dropped were dropped forever. Loss of very important tables or other objects (like indexes, partitions or clusters) required DBAs to perform a point-in-time recovery, which can be time-consuming and lead to loss of recent transactions.

Flashback Drop provides a safety net when dropping objects in Oracle Database 10g. When a user drops a table, Oracle places it in a recycle bin. Objects in the recycle bin remain there until the user decides to permanently remove them or until space limitations begin to occur on the tablespace containing the table. The recycle bin is a virtual container where all dropped objects reside. Users view the recycle bin and undrop the dropped table and its dependent objects. For example, the employees table and all its dependent objects would be undropped by the following statement:

FLASHBACK TABLE employees TO BEFORE DROP;

4.2.8.2 Resolving Row and Transaction Inconsistencies

Resolving row and transaction inconsistencies might require a combination of Flashback Query, Flashback Version Query, Flashback Transaction Query, and the compensating SQL statements constructed from undo statements to rectify the problem. This section describes a general approach using a human resources example to resolve row and transaction inconsistencies caused by erroneous or malicious user errors.

Flashback Query, a feature introduced in the Oracle9i Database, enables an administrator or user to query any data at some point in time in the past. This powerful feature can be used to view and reconstruct data that might have been deleted or changed by accident. For example:

SELECT * FROM EMPLOYEES 
       AS OF TIMESTAMP 
       TO_DATE('28-Jun-06 14:00','DD-Mon-YY HH24:MI')
WHERE ...

This partial statement displays rows from the EMPLOYEES table starting from 2 p.m. on June 28, 2006. Developers can use this feature to build self-service error correction into their applications, empowering end users to undo and correct their errors without delay, rather than burdening administrators to perform this task. Flashback Query is very simple to manage, because the database automatically keeps the necessary information to reconstruct data for a configurable time into the past.

Flashback Version Query provides a way to view changes made to the database at the row level. It is an extension to SQL and enables the retrieval of all the different versions of a row across a specified time interval. For example:

SELECT * FROM EMPLOYEES
       VERSIONS BETWEEN TIMESTAMP
       TO_DATE('28-Jun-06 14:00','dd-Mon-YY hh24:mi') AND
       TO_DATE('28-Jun-06 15:00','dd-Mon-YY hh24:mi')
WHERE ...

This statement displays each version of the row, each entry changed by a different transaction, between 2 and 3 p.m. today. A DBA can use this to pinpoint when and how data is changed and trace it back to the user, application, or transaction. This enables the DBA to track down the source of a logical corruption in the database and correct it. It also enables application developers to debug their code.

Flashback Transaction Query provides a way to view changes made to the database at the transaction level. It is an extension to SQL that enables you to see all changes made by a transaction. For example:

SELECT UNDO_SQL
FROM FLASHBACK_TRANSACTION_QUERY 
WHERE XID = '000200030000002D';

This statement shows all of the changes that resulted from this transaction. In addition, compensating SQL statements are returned and can be used to undo changes made to all rows by this transaction. Using a precision tool like this, the DBA and application developer can precisely diagnose and correct logical problems in the database or application.

Consider a human resources (HR) example involving the SCOTT schema. The HR manager reports to the DBA that there is a potential discrepancy in Ward's salary. Sometime before 9:00 a.m., Ward's salary was increased to $1875. The HR manager is uncertain how this occurred and wishes to know when the employee's salary was increased. In addition, he instructed his staff to reset the salary to the previous level of $1250. This was completed around 9:15 a.m.

The following steps show how to approach the problem.

  1. Assess the problem.

    Fortunately, the HR manager has provided information about the time when the change occurred. You can query the information as it was at 9:00 a.m. with Flashback Query.

        SELECT EMPNO, ENAME, SAL
        FROM EMP
        AS OF TIMESTAMP TO_DATE('28-JUN-06 09:00','dd-Mon-yy hh24:mi')
        WHERE ENAME = 'WARD';
    
             EMPNO ENAME             SAL
        ---------- ---------- ----------
              7521 WARD             1875
    
    

    To can confirm that you have the correct employee by the fact that Ward's salary was $1875 at 09:00 a.m. Rather than using Ward's name, you can now use the employee number for subsequent investigation.

  2. Query past rows or versions of the data to acquire transaction information.

    Although it is possible to restrict the row version information to a specific date or SCN range, you might want to query all the row information that is available for the employee WARD using Flashback Version Query.

        SELECT EMPNO, ENAME, SAL, VERSIONS_STARTTIME, VERSIONS_ENDTIME
        FROM EMP
        VERSIONS BETWEEN TIMESTAMP MINVALUE AND MAXVALUE
        WHERE EMPNO = 7521
        ORDER BY NVL(VERSIONS_STARTSCN,1);
    
    EMPNO ENAME          SAL VERSIONS_STARTTIME     VERSIONS_ENDTIME
    ----- ---------- ------- ---------------------- ----------------------
     7521 WARD          1250  28-JUN-06 08.48.43 AM  28-JUN-06 08.54.49 AM
     7521 WARD          1875  28-JUN-06 08.54.49 AM  28-JUN-06 09.10.09 AM
     7521 WARD          1250  28-JUN-06 09.10.09 AM
    
    

    You can see that WARD's salary was increased from $1250 to $1875 at 08:54:49 the same morning and was subsequently reset to $1250 at approximately 09:10:09.

    Also, you can modify the query to determine the transaction information for each of the changes affecting WARD using a similar Flashback Version Query. This time use the VERSIONS_XID pseudocolumn.

        SELECT EMPNO, ENAME, SAL, VERSIONS_XID
        FROM EMP
        VERSIONS BETWEEN TIMESTAMP MINVALUE AND MAXVALUE
        WHERE EMPNO = 7521
        ORDER BY NVL(VERSIONS_STARTSCN,1);
    
             EMPNO ENAME             SAL VERSIONS_XID
        ---------- ---------- ---------- ----------------
              7521 WARD             1250 0006000800000086
              7521 WARD             1875 0009000500000089
              7521 WARD             1250 000800050000008B
    
    

    The ID of the erroneous transaction that increased WARD's salary to $1875 was "0009000500000089".

  3. Query the erroneous transaction and the scope of its effect.

    With the transaction information (VERSIONS_XID pseudocolumn), you can now query the database to determine the scope of the transaction, using Flashback Transaction Query.

        SELECT UNDO_SQL
        FROM FLASHBACK_TRANSACTION_QUERY
        WHERE XID = HEXTORAW('0009000500000089');
    
        UNDO_SQL                                                                    
        ----------------------------------------------------------------------------
        update "SCOTT"."EMP" set "SAL" = '950' where ROWID = 'AAACV4AAFAAAAKtAAL';      
        update "SCOTT"."EMP" set "SAL" = '1500' where ROWID = 'AAACV4AAFAAAAKtAAJ';     
        update "SCOTT"."EMP" set "SAL" = '2850' where ROWID = 'AAACV4AAFAAAAKtAAF';      
        update "SCOTT"."EMP" set "SAL" = '1250' where ROWID = 'AAACV4AAFAAAAKtAAE';    
        update "SCOTT"."EMP" set "SAL" = '1600' where ROWID = 'AAACV4AAFAAAAKtAAB';     
                                                                                    
        6 rows selected.
    
    

    You can see that WARD's salary was not the only change that occurred in the transaction. The information that was changed for the other four employees at the same time as WARD can now be passed back to the HR manager for review.

  4. Determine if the corrective statements should be executed.

    If the HR manager decides that the corrective changes suggested by the UNDO_SQL column are correct, then the DBA can execute those statements individually.

  5. Query the FLASHBACK_TRANSACTION_QUERY view for additional transaction information. For example, to determine the user that performed the erroneous update, issue the following query:

    SELECT LOGON_USER FROM FLASHBACK_TRANSACTION_QUERY
    WHERE XID = HEXTORAW('0009000500000089');
    
    LOGON_USER
    ----------------------------
    MSMITH
    
    

    In this example, the query shows that the user MSMITH was responsible for the erroneous transaction.

4.2.8.3 Resolving Database-Wide Inconsistencies

To bring an Oracle database to a previous point in time, the traditional method is point-in-time recovery. However, point-in-time recovery can take hours or even days, because it requires the whole database to be restored from backup and recovered to the point in time just before the error was introduced into the database. With the size of databases constantly growing, it will take hours or even days just to restore the whole database.

Flashback Database is a new strategy for doing point-in-time recovery. It quickly rewinds an Oracle database to a previous time to correct any problems caused by logical data corruption or user error. Flashback logs are used to capture old versions of changed blocks. One way to think of it is as a continuous backup or storage snapshot. When recovery must be performed the flashback logs are quickly replayed to restore the database to a point in time before the error and just the changed blocks are restored. It is extremely fast and reduces recovery time from hours to minutes. In addition, it is easy to use. A database can be recovered to 2:05 p.m. by issuing a single statement. Before the database can be recovered, all instances of the database must be shut down and one of the instances subsequently mounted. The following is an example of a FLASHBACK DATABASE statement.

FLASHBACK DATABASE TO TIMESTAMP SYSDATE-1;

No restoration from tape, no lengthy downtime, and no complicated recovery procedures are required to use it. You can also use Flashback Database and then open the database in read-only mode and examine its contents. If you determine that you flashed back too far or not far enough, then you can reissue the FLASHBACK DATABASE statement or continue recovery to a later time to find the proper point in time before the database was damaged. Flashback Database works with a production database, a physical standby database, or a logical standby database.

These steps are recommended for using Flashback Database:

  1. Determine the time or the SCN to which to flash back the database.

  2. Verify that there is sufficient flashback log information.

         SELECT OLDEST_FLASHBACK_SCN, 
           TO_CHAR(OLDEST_FLASHBACK_TIME, 'mon-dd-yyyy HH:MI:SS') 
           FROM V$FLASHBACK_DATABASE_LOG;
    
    
  3. Flash back the database to a specific time or SCN. (The database must be mounted to perform a Flashback Database.)

    FLASHBACK DATABASE TO SCN scn;
    
    

    or

    FLASHBACK DATABASE TO TIMESTAMP TO_DATE date;
    
    
  4. Open the database in read-only mode to verify that it is in the correct state.

    ALTER DATABASE OPEN READ ONLY;
    
    

    If more flashback data is required, then issue another FLASHBACK DATABASE statement. (The database must be mounted to perform a Flashback Database.)

    If you want to move forward in time, then issue a statement similar to the following:

    RECOVER DATABASE UNTIL [TIME | CHANGE] date | scn;
    
    
  5. Open the database:

    ALTER DATABASE OPEN RESETLOGS;
    
    

Other considerations when using Flashback Database are as follows:

  • If there are not sufficient flashback logs to flash back to the target time, then use one of the following alternatives:

    • Use Data Guard to recover to the target time if the standby lags behind the primary database or flash back to the target time if there's sufficient flashback logs on the standby.

    • Restore from backups.

  • After flashing back a database, any dependent database such as a standby database must be flashed back. See Section 4.3, "Restoring Fault Tolerance".

Flashback Database does not automatically fix a dropped tablespace, but it can be used to dramatically reduce the downtime. You can flash back the production database to a point before the tablespace was dropped and then restore a backup of the corresponding datafiles from the affected tablespace and recover to a time before the tablespace was dropped.

Follow these recommended steps to use Flashback Database to repair a dropped tablespace:

  1. Determine the SCN or time you dropped the tablespace.

  2. Flash back the database to a time before the tablespace was dropped. You can use a statement similar to the following:

    FLASHBACK DATABASE TO BEFORE SCN drop_scn;
    
    
  3. Restore, rename, and bring datafiles online.

    1. Restore only the datafiles from the affected tablespace from a backup.

    2. Rename the unnamed files to the backup files.

        ALTER DATABASE RENAME FILE '.../UNNAMED00005' to 'restored_file';
    
    
  4. Bring the datafiles online.

    ALTER DATABASE DATAFILE 'name' ONLINE; 
             
    
  5. Query and recover the database.

    SELECT CHECKPOINT_CHANGE# FROM V$DATAFILE_HEADER WHERE FILE#=1;
    RECOVER DATABASE UNTIL CHANGE checkpoint_change#;
    
    
  6. Open the database.

    ALTER DATABASE OPEN RESETLOGS;
    

4.3 Restoring Fault Tolerance

Whenever a component in a high-availability architecture fails, then the full protection—or fault tolerance—of the architecture is compromised and possible single points of failure exist until the component is repaired. Restoring the high-availability architecture to full fault tolerance to reestablish full RAC, Data Guard, or MAA protection requires repairing the failed component. While full fault tolerance might be sacrificed during planned downtime, the method of repair is well understood because it is planned, the risk is controlled, and it ideally occurs at times best suited for continued application availability. However, for unplanned downtime, the risk of exposure to a single point of failure must be clearly understood.

This section provides the following topics that describe the steps needed to restore database fault tolerance:

4.3.1 Restoring Failed Nodes or Instances in a RAC Cluster

Ensuring that application services fail over quickly and automatically in a RAC cluster—or between primary and secondary sites—is important when planning for both scheduled and unscheduled outages. To ensure that the environment is restored to full fault tolerance after any errors or issues are corrected, it is also important to understand the steps and processes for restoring failed instances or nodes within a RAC cluster or databases between sites.

Adding a failed node back into the cluster or restarting a failed RAC instance is easily done after the core problem that caused the specific component to originally fail has been corrected. However, you should also consider:

  • When to perform these tasks so as to incur minimal or no effect on the current running environment

  • Resetting network components (such as load balancer) which were modified for failover and now must be reset

  • Failing back or rebalancing existing connections

After the problem that caused the initial node or instance failure has been corrected, a node or instance can be restarted and added back into the RAC environment at any time. Processing to complete the reconfiguration of a node may require additional system resources.

Table 4-13 summarizes additional processing that may be required when adding a node.

Table 4-13 Additional Processing When Restarting or Rejoining a Node or Instance

Action Additional Resources

Restarting a node or rejoining a node into a cluster

When using only Oracle Clusterware, there is no impact when a new node joins the cluster.

When using vendor clusterware, there may be performance degradation while reconfiguration occurs to add a node back into the cluster. The impact on current applications should be evaluated with a full test workload.

Restarting or rejoining a RAC instance

When you restart a RAC instance, there might be some potential performance impact while lock reconfiguration takes place. The impact on current applications is usually minimal, but it should be evaluated with a full test workload.


See Also:

The following recovery methods can be used:

4.3.1.1 Recovering Service Availability

After a failed node has been brought back into the cluster and its instance has been started, Cluster Ready Services (CRS) automatically manages the virtual IP address used for the node and the services supported by that instance automatically. A particular service might or might not be started for the restored instance. The decision by CRS to start a service on the restored instance depends on how the service is configured and whether the proper number of instances are currently providing access for the service. A service is not relocated back to a preferred instance if the service is still being provided by an available instance to which it was moved by CRS when the initial failure occurred. CRS restarts services on the restored instance if the number of instances that are providing access to a service across the cluster is less than the number of preferred instances defined for the service. After CRS restarts a service on a restored instance, CRS notifies registered applications of the service change.

For example, suppose the HR service is defined with instances A and B as preferred and instances C and D as available in case of a failure. If instance B fails and CRS starts up the HR service on C automatically, then when instance B is restarted, the HR service remains at instance C. CRS does not automatically relocate a service back to a preferred instance.

Suppose a different scenario in which the HR service is defined with instances A, B, C, and D as preferred and no instances defined as available, spreading the service across all nodes in the cluster. If instance B fails, then the HR service remains available on the remaining three nodes. CRS automatically starts the HR service on instance B when it rejoins the cluster because it is running on fewer instances than configured. CRS notifies the applications that the HR service is again available on instance B.

4.3.1.2 Considerations for Client Connections After Restoring a RAC Instance

After a RAC instance has been restored, additional steps might be required, depending on the current resource utilization and performance of the system, the application configuration, and the network load balancing that has been implemented.

Existing connections (which might have failed over or started as a new session) on the surviving RAC instances, are not automatically redistributed or failed back to an instance that has been restarted. Failing back or redistributing users might or might not be necessary, depending on the current resource utilization and the capability of the surviving instances to adequately handle and provide acceptable response times for the workload. If the surviving RAC instances do not have adequate resources to run a full workload or to provide acceptable response times, then it might be necessary to move (disconnect and reconnect) some existing user connections to the restarted instance.

New connections are started as they are needed, on the least-used node, assuming connection load balancing has been configured. Therefore, the new connections are automatically load-balanced over time.

An application service can be:

  • Partitioned with services running on a subset of RAC instances

  • Nonpartitioned so that all services run equally across all nodes

This is valuable for modularizing application and database form and function while still maintaining a consolidated data set. For the cases where an application is partitioned or has a combination of partitioning and nonpartitioning, the response time and availability aspects for each service should be considered. If redistribution or failback of connections for a particular service is required, then you can rebalance workloads manually with the DBMS_SERVICE.DISCONNECT_SESSION PL/SQL procedure. You can use this procedure to disconnect sessions associated with a service while the service is running.

For load-balancing application services across multiple RAC instances, Oracle Net connect-time failover and connection load balancing are recommended. This feature does not require changes or modifications for failover or restoration. It is also possible to use hardware-based load balancers. However, there might be limitations in distinguishing separate application services (which is understood by Oracle Net Services) and restoring an instance or a node. For example, when a node or instance is restored and available to start receiving new connections, a manual step might be required to include the restored node or instance in the hardware-based load balancer logic, whereas Oracle Net Services does not require manual reconfiguration.

Table 4-14 summarizes the considerations for new and existing connections after an instance has been restored. The considerations differ depending on whether the application services are partitioned, nonpartitioned, or are a combination of both. The actual redistribution of existing connections might or might not be required depending on the resource utilization and response times.

Table 4-14 Restoration and Connection Failback

Application Services Failback or Restore Existing Connections Failback or Restore New Connections

Partitioned

Existing sessions are not automatically relocated back to the restored instance. Use DBMS_SERVICE.DISCONNECT_SESSION to manually disconnect sessions and allow them to be reestablished on one of the remaining instances that provides the service.

Automatically routes to the restored instance by using the Oracle Net Services configuration.

Nonpartitioned

No action is necessary unless the load must be rebalanced, because restoring the instance means that the load there is low. If the load must be rebalanced, then the same problems are encountered as if application services were partitioned.

Automatically routes to the restored instance (because its load should be lowest) by using the Oracle Net Services configuration


Figure 4-13 shows a two-node partitioned RAC database. Each instance services a different portion of the application (HR and Sales). Client processes connect to the appropriate instance based on the service they require.

Figure 4-13 Partitioned Two-Node RAC Database

Description of Figure 4-13 follows
Description of "Figure 4-13 Partitioned Two-Node RAC Database"

Figure 4-14 shows what happens when one RAC instance fails.

Figure 4-14 RAC Instance Failover in a Partitioned Database

Description of Figure 4-14 follows
Description of "Figure 4-14 RAC Instance Failover in a Partitioned Database"

If one RAC instance fails, then the service and existing client connections can be automatically failed over to another RAC instance. In this example, the HR and Sales services are both supported by the remaining RAC instance. In addition, new client connections for the Sales service can be routed to the instance now supporting this service.

After the failed instance has been repaired and restored to the state shown in Figure 4-13 and the Sales service is relocated to the restored instance failed-over clients and any new clients that had connected to the Sales service on the failed-over instance might have to be identified and failed back. New client connections, which are started after the instance has been restored, should automatically connect back to the original instance. Therefore, over time, as older connections disconnect, and new sessions connect to the Sales service, the client load migrates back to the restored instance. Rebalancing the load immediately after restoration depends on the resource utilization and application response times.

Figure 4-15 shows a nonpartitioned application. Services are evenly distributed across both active instances. Each instance has a mix of client connections for both HR and Sales.

Figure 4-15 Nonpartitioned RAC Instances

Description of Figure 4-15 follows
Description of "Figure 4-15 Nonpartitioned RAC Instances"

If one RAC instance fails, then CRS moves the services that were running on the failed instance. In addition, new client connections are routed only to the available RAC instances that offer that service.

After the failed instance has been repaired and restored to the state shown in Figure 4-15, some clients might have to be moved back to the restored instance. For nonpartitioned applications, identifying appropriate services is not required for rebalancing the client load among all available instances. Also, this is necessary only if a single instance is not able to adequately service the requests.

New client connections that are started after the instance has been restored should automatically connect back to the restored instance because it has a smaller load. Therefore, over time, as older connections disconnect and new sessions connect to the restored instance, the client load will again evenly balance across all available RAC instances. Rebalancing the load immediately after restoration depends on the resource utilization and application response times.

4.3.2 Restoring a Standby Database After a Failover

Following unplanned downtime on a production database that requires a failover, full fault tolerance is compromised until the standby database is reestablished. Full database protection should be restored as soon as possible. The steps for restoring fault tolerance differ slightly between physical and logical standby databases.

Reinstating databases is automated if you are using Data Guard fast-start failover. After a fast-start failover completes, the observer automatically attempts to reinstate the former primary database as a standby database. Reinstatement restores high availability to the broker configuration so that, in the event of a failure of the new primary database, another fast-start failover can occur. The reinstated database can act as the fast-start failover target for the new primary database, making a subsequent fast-start failover possible. The new standby database is a viable target of a failover when it begins applying redo data received from the new primary database.

The broker can reinstate the former primary database as a standby database without the need to reenable the primary database or to manually perform a Flashback Database operation. To reinstate the former primary database, the database must be started and mounted, but it cannot be opened. The broker reinstates the database as a standby database of the same type (physical or logical) as the former standby database.

If the former primary database cannot be reinstated automatically, you can manually reinstate it using either the DGMGRL REINSTATE command or Enterprise Manager. Step-by-step instructions for manual reinstatement are described in Oracle Data Guard Broker.

Standby databases do not have to be reinstantiated if you use the Oracle Flashback Database feature. Flashback Database has the following advantages:

  • Saves hours of database restoration time

  • Reduces overall complexity in restoring fault tolerance

  • Reduces the time that the system is vulnerable because the standby database is re-created more quickly

See Also:

The following topics in Oracle Data Guard Concepts and Administration:
  • Flashing Back a Failed Primary Database into a Physical Standby Database

  • Flashing Back a Failed Primary Database into a Logical Standby Database

This section includes the following topics:

4.3.2.1 Restoring a Standby Database After a Fast-Start Failover

Following a fast-start failover, the observer periodically attempts to reconnect to the original primary database. When the observer regains network access to the original primary database, it initiates a request for the Data Guard broker to automatically reinstate it as a standby database to the new primary. This quickly restores disaster protection and high availability for the new primary database.

You can enable fast-start failover from any site, including the observer site, in Enterprise Manager while connected to any database in the broker configuration. The broker simplifies switchovers and failovers by allowing you to invoke them using a single key click in Oracle Enterprise Manager, as shown in Figure 4-16.

Figure 4-16 Fast-Start Failover and the Observer Are Successfully Enabled

Data Guard page showing that fast-start failover is enabled.
Description of "Figure 4-16 Fast-Start Failover and the Observer Are Successfully Enabled"

4.3.2.2 Reinstating a Standby Database Using Enterprise Manager After a Failover

Furthermore, you can leverage Enterprise Manager to reinstate the old primary as the new standby. Figure 4-17 shows an example of the warning message that shows in Enterprise Manager when a reinstatement is needed.

Figure 4-17 Reinstating the Former Primary Database After a Fast-Start Failover

Description of Figure 4-17 follows
Description of "Figure 4-17 Reinstating the Former Primary Database After a Fast-Start Failover"

4.3.3 Restoring ASM Disk Groups after a Failure

Follow the steps in Section 4.2.6.3, "Data Area Disk Group Failure" or Section 4.2.6.4, "Flash Recovery Area Disk Group Failure".

4.3.4 Restoring Fault Tolerance After Planned Downtime on Secondary Site or Clusterwide Outage

After performing the planned maintenance on the secondary site, the standby database and log apply services must be restarted, and then Data Guard will automatically catch up. You can leverage Enterprise Manager and Data Guard broker to monitor the Data Guard state.

The following steps are required to restore full fault tolerance after planned downtime on a secondary site or clusterwide outage:

Note:

The following steps can be accomplished manually (as described below) or automatically using Enterprise Manager.
  1. Start the standby database

    You might have to restore the standby database from local backups, local tape backups, or from the primary site backups if the data in the secondary site has been damaged. Re-create the standby database from the new production database by following the steps for creating a standby database in Oracle Data Guard Concepts and Administration.

    After the standby database has been reestablished, start the standby database.

    Table 4-15 SQL Statements for Starting Physical and Logical Standby Databases

    Type of Standby Database SQL Statement

    Physical

    STARTUP MOUNT;

    Logical

    STARTUP;


  2. Start Redo Apply (physical standby) or SQL Apply (logical standby):

    Table 4-16 SQL Statements to Start Redo Apply and SQL Apply

    Type of Standby Database SQL Statement

    Physical

    RECOVER MANAGED STANDBY DATABASE DISCONNECT;

    Logical

    ALTER DATABASE START LOGICAL STANDBY APPLY;


  3. Verify redo transport services on production database

    You might have to reenable the production database remote archive destination. Query the V$ARCHIVE_DEST_STATUS view first to see the current state of the archive destinations:

    SELECT DEST_ID, DEST_NAME, STATUS, PROTECTION_MODE, DESTINATION, ERROR, SRL 
          FROM V$ARCHIVE_DEST_STATUS;
    ALTER SYSTEM SET LOG_ARCHIVE_DEST_STATE_n=ENABLE; 
    ALTER SYSTEM ARCHIVE LOG CURRENT;
    
    

    Verify redo transport services between the production and standby databases by checking for errors. Query the V$ARCHIVE_DEST and V$ARCHIVE_DEST_STATUS views:

    SELECT STATUS, TARGET, LOG_SEQUENCE, TYPE, PROCESS, REGISTER, ERROR 
        FROM V$ARCHIVE_DEST; 
    SELECT * FROM V$ARCHIVE_DEST_STATUS WHERE STATUS!='INACTIVE';
    
    
  4. Verify that recovery is progressing on standby database

    • For a physical standby database, verify that there are no errors from the managed recovery process and that the recovery has applied the redo from the archived redo log files:

      SELECT MAX(SEQUENCE#), THREAD# FROM V$LOG_HISTORY GROUP BY THREAD;
      SELECT PROCESS, STATUS, THREAD#, SEQUENCE#, CLIENT_PROCESS
          FROM V$MANAGED_STANDBY;
      
      
    • For a logical standby database, verify that there are no errors from the logical standby process and that the recovery has applied the redo from the archived redo logs:

      SELECT THREAD#, SEQUENCE# SEQ# 
          FROM DBA_LOGSTDBY_LOG LOG, DBA_LOGSTDBY_PROGRESS PROG 
          WHERE PROG.APPLIED_SCN BETWEEN LOG.FIRST_CHANGE# AND LOG.NEXT_CHANGE# 
          ORDER BY NEXT_CHANGE#;
      
      
  5. Restore production database protection mode

    If you had to change the protection mode of the production database from maximum protection to either maximum availability or maximum performance because of the standby database outage, then change the production database protection mode back to maximum protection depending on your business requirements.

    ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE [PROTECTION | AVAILABILITY];
    
    

4.3.5 Restoring Fault Tolerance After a Standby Database Data Failure

Following unplanned downtime on the standby database that requires a full or partial datafile restoration (such as data or media failure), full fault tolerance is compromised until the standby database is brought back into service. Full database protection should be restored as soon as possible.

To repair data corruption and data failures on a logical standby database, you require a backup of the logical standby file and not a backup from the primary database. Otherwise, you need to reinstantiate or re-create the relevant objects that got affected by the corruption.To repair data corruption or data failures on the standby database, you can leverage the following repair solutions:

If you had to change the protection mode of the production database from maximum protection to either maximum availability or maximum performance because of the standby database outage, then change the production database protection mode back to maximum protection (depending on your business requirements).

ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE [PROTECTION | AVAILABILITY];

4.3.6 Restoring Fault Tolerance After the Production Database Was Opened Resetlogs

If the production database is activated because it was flashed back to correct a logical error or because it was restored and recovered to a point in time, then the corresponding standby database might require additional maintenance. No additional work is required if the production database did complete recovery with no resetlogs.

After opening the production database with the RESETLOGS option, execute the queries shown in Table 4-17.

Table 4-17 Queries to Determine RESETLOGS SCN and Current SCN OPEN RESETLOGS

Database Query

Production

SELECT TO_CHAR(RESETLOGS_CHANGE# - 2) FROM V$DATABASE;

Physical standby

SELECT TO_CHAR(CURRENT_SCN) FROM V$DATABASE;

Logical standby

SELECT APPLIED_SCN FROM DBA_LOGSTDBY_PROGRESS;


Table 4-18 shows the actions you take to restore fault tolerance if the standby database is behind the primary database's resetlogs SCN.

Table 4-18 SCN on Standby Database is Behind Resetlogs SCN on the Production Database

Database Action

Physical standby

  1. Ensure that the standby database has received a new archived redo log file from the production database.

    See Also: "Verify redo transport services on production database"

  2. Restart Redo Apply.

Logical standby

Ensure that the standby database has received a new archived redo log file from the production database.

See Also: "Verify redo transport services on production database"


Table 4-19 shows the actions you take to restore fault tolerance if the standby database is ahead of the primary database's resetlogs SCN.

Table 4-19 SCN on the Standby is Ahead of Resetlogs SCN on the Production Database

Database Action

Physical standby

  1. Ensure that the standby database has received a new archived redo log file from the production database.

    See Also: "Verify redo transport services on production database"

  2. Issue the SHUTDOWN IMMEDIATE statement, if necessary

  3. Issue the STARTUP MOUNT statement

  4. Issue the FLASHBACK DATABASE TO SCN flashback_scn statement where flashback_scn is the SCN returned from the production database query in Table 4-17. The SCN returned from the production database query is 2 less than the RESETLOGS_CHANGE#.

    Issue the FLASHBACK DATABASE TO SCN resetlogs_change#_minus_2 statement

  5. Restart Redo Apply with or without real-time apply:

    With real-time apply:

    ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT;
    
    

    Without real-time apply:

    ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT;
    

Logical standby

  1. Retrieve production database flashback time or SCN. The flashback time or SCN must be extracted from the production database alert log. This assumes that the clocks are the same between the database machines or the flashback time will need to be adjusted.

  2. Stop SQL Apply:

    ALTER DATABASE STOP LOGICAL STANDBY APPLY;
    SELECT APPLIED_SCN FROM DBA_LOGSTDBY_PROGRESS;
    
  3. Issue the following SQL statements to flash back the logical standby database to the same time that was used to flash back the primary database:

    SHUTDOWN;
    STARTUP MOUNT EXCLUSIVE;
    FLASHBACK DATABASE TO TIMESTAMP time_of_primary_database_flashback;
    ALTER DATABASE OPEN READ ONLY;
    SELECT APPLIED_SCN FROM DBA_LOGSTDBY_PROGRESS;
    
    

    The last SQL statement queries the APPLIED_SCN column of the DBA_LOGSTDBY_PROGRESS view, the results of this query should confirm that SQL Apply has applied less than or up to the APPLIED_SCN obtained in step 2.If not, you need to flash back the database further.

  4. Open the logical standby database with resetlogs:

    SHUTDOWN;
    STARTUP MOUNT EXCLUSIVE;
    ALTER DATABASE OPEN RESETLOGS;
    
  5. Archive the current log on the primary database:

    ALTER SYSTEM ARCHIVE LOG CURRENT;
    
  6. Start SQL Apply:

    ALTER DATABASE START LOGICAL STANDBY APPLY;
    

4.3.7 Restoring Fault Tolerance After Dual Failures

If a dual failure affecting both the standby and production databases occurs, then you must re-create the production database first. Because the sites are identical, the production database can be created wherever the most recent backup resides.

Table 4-20 summarizes the recovery strategy depending on the type of backups that are available.

Table 4-20 Re-Creating the Production and Standby Databases

Available Backups Re-Creating the Production Database

Local backup on production and standby databases

Restore backup from the production database. Recover and activate the database as the new production database.

Local backup only on standby database. Tape backups on standby database.

Restore the local standby backup to the standby database. Recover and activate the database as the new production database.

Tape backups only

Restore tape backups locally. Recover the database and activate it as the new production database.


See Also:

After the production database is re-created, follow the steps for creating a new standby database that are described inOracle Data Guard Concepts and Administration

4.4 Eliminating or Reducing Downtime for Scheduled Outages

This section describes best practices for eliminating or reducing downtime due to scheduled outages and contains the following topics:

4.4.1 Storage Maintenance

The following procedure should be used when you add storage to the system. The procedures in the following sections assume that you are adding storage to an ASM Disk Group.

4.4.1.1 Migrating to ASM Storage

If you have an existing Oracle database that stores database files on a file system or on raw devices, you can migrate some or all of these database files to ASM. The DBMS_FILE_TRANSFER package can be used for this purpose.

See Also:

4.4.1.2 Adding and Removing Storage

Disks can be added to and removed from ASM with no downtime. When disks are added or removed, ASM automatically starts a rebalance operation to evenly spread the disk group contents over all drives in the disk group.The best practices for adding or removing storage include:

  • Investigate methods of adding storage to, and removing storage from, the host operating system with no downtime.

  • Use a single ALTER DISKGROUP command when adding or removing multiple disk drives.

    For example, if the storage maintenance is to add new drives and remove existing drives, use a single ALTER DISKGROUP command with the ADD DISK clause to add the new drives, and the DROP DISK clause to remove the existing drives. For example:

    ALTER DISKGROUP data
           DROP DISK diska5
           ADD FAILGROUP failgrp1 DISK '/devices/diska9' NAME diska9;
    
    
  • When dropping disks from a disk group, specify the WAIT option in the REBALANCE clause so the ALTER DISKGROUP statement does not return until the contents of the drives being dropped have been moved to other drives. Once the statement completes, the drives can be safely removed from the system. For example:

    ALTER DISKGROUP data
    DROP DISK diska5
    ADD FAILGROUP failgrp1 DISK '/devices/diska9' NAME diska9
    REBALANCE WAIT;
    
    
  • When dropping disks in a normal or high redundancy disk group, ensure there is enough free disk space in the disk group to reconstruct full redundancy.

  • Monitor the progress of rebalance operations using Enterprise Manager or by querying V$ASM_OPERATION.

  • For long-running rebalance operations that occur during periods of low database activity, increase the rebalance power limit to reduce the rebalance time.

4.4.2 RAC Database Patches

With RAC, you can apply certain database patches to one node or instance at a time, which enables continual application and database availability. "One-off" patches or interim patches to database software are usually applied to implement known fixes for software problems an installation has encountered or to apply diagnostic patches to gather information regarding a problem. Such patch application is often carried out during a scheduled maintenance outage.

Oracle now provides the capability to do rolling patch upgrades with Real Application Clusters with little or no database downtime. The tool used to achieve this is the opatch command-line utility.

The advantage of a RAC rolling upgrade is that it enables at least some instances of the RAC installation to be available during the scheduled outage required for patch upgrades. Only the RAC instance that is currently being patched must be brought down. The other instances can continue to remain available. This means that the effect on the application downtime required for such scheduled outages is further minimized. Oracle's opatch utility enables the user to apply the patch successively to the different instances of the RAC installation.

Rolling upgrade is available only for patches that have been certified by Oracle to be eligible for rolling upgrades. Typically, patches that can be installed in a rolling upgrade include:

  • Patches that do not affect the contents of the database such as the data dictionary

  • Patches not related to RAC internode communication

  • Patches related to client-side tools such as SQL*PLUS, Oracle utilities, development libraries, and Oracle Net

  • Patches that do not change shared database resources such as datafile headers, control files, and common header definitions of kernel modules

Rolling upgrade of patches is currently available for one-off patches only. It is not available for patch sets.

Rolling patch upgrades are not available for deployments where the Oracle Database software is shared across the different nodes. This is the case where the Oracle home is on Cluster File System (CFS) or on shared volumes provided by file servers or NFS-mounted drives. The feature is only available where each node has its own copy of the Oracle Database software.

4.4.2.1 Best Practices To Minimize Downtime

Use the following recommended practices for all database patch upgrades:

  • Always confirm with Oracle Support Services that the patch is valid for your problem and for your deployment environment.

  • Have a plan for applying the patch as well as a plan for backing out the patch.

  • Apply the patch to your test environment first and verify that it fixes the problem.

  • When you plan the elapsed time for applying the patch, include time for starting up and shutting down the other tiers of your technology stack if necessary.

  • If the patch is not a candidate for RAC rolling upgrade and you can incur the downtime for applying the patch, go to Section 4.4.3, "Database Upgrades" to assess whether or not other solutions are feasible.

The following are additional recommended practices for RAC rolling upgrades.

  • If multiple instances share an Oracle home, then all of them will be affected by application of a patch. The DBA should verify that this will not cause unintentional side effects. Also, all such instances on a node must be shut down during the patch application. Scheduled outage planning should take this into account. As a best practice, only similar applications should share an Oracle home on a node. This provides greater flexibility for patching.

  • The Oracle inventory on each node is a repository of the Oracle Database software installed on the node. The inventory is node-specific. It is shared by all Oracle software installed on the node. It is similar across nodes only if all nodes are exactly the same in terms of the Oracle Database software deployed, the deployment configuration, and patch levels. Because the Oracle inventory greatly aids the patch application and patch management process, it is recommended that its integrity be maintained. Oracle inventory should be backed up after each patch installation to any Oracle software on a specific node. This applies to the Oracle inventory on each node of the cluster.

  • Use the Oracle Universal Installer to install all Oracle database software. This creates the relevant repository entries in the Oracle inventory on each node of the cluster. Also, use the Oracle Universal Installer to add nodes to an existing RAC cluster.

    However, if this was not done or is not feasible for some reason, adding information about an existing Oracle database software installation to the Oracle inventory can be done with the attach option of the opatch utility. Node information can be also added with this option.

  • The nature of the rolling patch upgrade enables it to be applied to only some nodes of the RAC cluster. So an instance can be operating with the patch applied, while another instance is operating without the patch. This is not possible for nonrolling patch upgrades. Apply nonrolling patch upgrades to all instances before the RAC deployment is activated. A mixed environment is useful if a patch must be tested before deploying it to all the instances. Applying the patch with the -local option is the recommended way to do this.

    In the interest of keeping all instances of the RAC cluster at the same patch level, it is strongly recommended that after a patch has been validated, it should be applied to all nodes of the RAC installation. When instances of a RAC cluster have similar patch software, services can be migrated among instances without running into the problem a patch might have fixed.

  • All patches (including those applied by rolling upgrades) should be maintained online and not removed once they have been applied. This is useful if a patch must be rolled back or applied again.

    The patches should be stored in a location that is accessible by all nodes of the cluster. Thus all nodes of the cluster are equivalent in their capability to apply or roll back a patch.

  • Rolling patch upgrades, just like any other patch upgrade, should be done when no other patch upgrade or Oracle installation is being done on the node. Application of multiple patches is a sequential process. The scheduled outage should be planned accordingly.

  • If multiple patches have to be applied and they must be applied at the same time, and if only some of these patches are eligible for rolling upgrade, then apply all of them in a nonrolling manner. This reduces the overall time required to get through the patching process.

  • For patches that are not eligible for rolling upgrade, the next best option for RAC deployments is the minimize_downtime option of the apply command.

  • Perform the rolling upgrade when system usage is low. This ensures minimal disruption of service for the end user.

See Also:

Oracle Universal Installer and OPatch User's Guide for more information on the opatch utility

4.4.3 Database Upgrades

The following Oracle features are available to perform database upgrades:

  • Database Upgrade Assistant (DBUA)

  • Data Guard SQL Apply (logical standby database)

  • Oracle Streams

  • Transportable tablespaces

The method you choose to perform database upgrades can vary depending on the following considerations:

  • Downtime required to complete the upgrade

  • Setup time and effort required prior to the downtime

  • Temporary additional resources necessary (for example, disk space or CPU)

  • Complexity of the steps needed to complete the upgrade

Table 4-21 lists the methods that can be used for platform migrations and database upgrades, and recommends what method to use.

Table 4-21 Platform Migration and Database Upgrade Options

Upgrade Method Use This Method When...

Database Upgrade Assistant

Recommended method when maintenance window is sufficient

Data Guard SQL Apply (Logical Standby)

DBUA will not finish within the maintenance window and the database is not a candidate for RAC rolling patch upgrade

Oracle Streams

Already a Streams implementation or Data Guard SQL Apply rolling upgrade does not support database versions in use

Transportable Tablespaces

Database is using data types unsupported by Data Guard SQL Apply or Streams


4.4.3.1 Database Upgrade Assistant

Database Upgrade Assistant (DBUA) is used to upgrade a database in place from an earlier software version.

When deciding if DBUA is the proper tool to use when performing a database upgrade with minimal downtime, consider the following:

  • DBUA upgrades the database dictionary and all components (for example: Java, XDB, Streams, and so on) that have been installed while the database is unavailable for normal user activity.

  • Downtime required for a database upgrade when using DBUA is determined by the time needed to:

    • Upgrade all database dictionary objects to the new version

    • Recompile all PL/SQL

    • Reconnect the clients to the upgraded database

Use DBUA for a database upgrade when the time to perform the upgrade with this method fits within the maintenance window.

See Also:

Oracle Database Upgrade Guide for more information on DBUA and upgrading your Oracle Database software

4.4.3.2 Data Guard SQL Apply (Logical Standby)

Data Guard SQL Apply can be used to upgrade a database with minimal downtime by means of a process called rolling upgrade. Data Guard currently supports homogeneous environments where the primary and standby databases are running on the same platform.

Note the following points when deciding if Data Guard SQL Apply is the appropriate method for minimizing downtime during a database upgrade:

  • The Data Guard SQL Apply infrastructure uses Oracle Streams and therefore inherits Oracle Streams data type restrictions on user-defined types, such as object types, REF values, varrays, and nested tables.

  • Support for rolling upgrade starts with Oracle Database 10g release 1 (10.1.0.3). The supported versions, for both the source database and the target database, are more restrictive than Oracle Streams.

  • Downtime required for a database upgrade (rolling upgrade) when using Data Guard SQL Apply is determined by the time needed to:

    • Perform a Data Guard switchover

    • Reconnect the clients to the new database

Use Data Guard SQL Apply for rolling database upgrade when DBUA will not complete the upgrade within the maintenance window and the application does not use user-defined types.

See Also:

Oracle Data Guard Concepts and Administration for more information on SQL Apply

4.4.3.3 Oracle Streams

Oracle Streams can be used to upgrade the database software from one version to another with minimal downtime. This is because Oracle Streams supports a configuration in which the primary database and its replica are running on different database versions.

Note the following points when deciding if Oracle Streams is an appropriate method for a database upgrade:

  • Oracle Streams does not support user-defined types, such as object types, REF values, varrays, and nested tables. However, shadow tables can be created on the primary database that do not have the unsupported data types and the shadow tables can be replicated.

  • The source database must be running Oracle9i release 2 or higher.

  • The administrative effort required to set up and maintain the Oracle Streams environment is more than if using Data Guard SQL Apply for a database upgrade.

  • There might be a performance impact on the source database while the source and target databases run in parallel as changes are propagated to the target database.

  • Downtime required for a database upgrade when using Oracle Streams is determined by the time needed to apply the remaining transactions in the queue and to reconnect the clients to the new database.

Consider using Oracle Streams if the application already uses Streams or when clients do not use user-defined types and the extra administrative effort is worth the opportunity for a very small outage time.

See Also:

Oracle Streams Concepts and Administration for more information on database upgrading using Oracle Streams

4.4.3.4 Transportable Tablespaces

Transportable tablespaces can be used to accomplish a database upgrade by transporting all user datafiles into a pre-created, prepared target database.

Note the following points when deciding if transportable tablespaces is the appropriate method for performing a database upgrade:

  • The SYSTEM tablespace cannot be moved with transportable tablespaces. The target database SYSTEM tablespace contents, including user definitions and objects necessary for the application, must be built manually. Use Data Pump to move the contents of the SYSTEM tablespace.

  • Downtime required for a database upgrade when using transportable tablespaces is determined by the time needed to:

    • Place the source database tablespaces in read-only mode

    • Perform a network import of the transportable metadata

    • If the target database is on a remote system, then include the time to transfer all datafiles from the source system to the target system

      The time it takes to transfer the datafiles can be reduced significantly by using a storage infrastructure that can make the datafiles available to the target system without the need to physically move the files, or by using a physical standby database.

Use transportable tablespaces to perform a database upgrade when DBUA will not complete within the maintenance window, and Oracle Streams or Data Guard SQL Apply cannot be used due to data type restrictions.

See Also:

Oracle Database Administrator's Guide for more information on transportable tablespaces

4.4.4 Database Platform or Location Migration

The following Oracle features are available to perform platform migrations:

The method you choose to perform these database maintenance tasks depends on the following considerations:

  • Downtime required to complete the maintenance operations

  • Setup time and effort required prior to the downtime

  • Amount of temporary additional resources necessary, such as disk space or CPU

  • Complexity of the steps needed to complete maintenance operations

The following table summarizes the methods that can be used for platform migrations and database upgrades, and recommends which method to use for each operation.

Table 4-22 Platform and Location Migration Options

Operation Method When to Use

Platform migration to same endian platform

Transportable Database

Recommended method

Oracle Streams

Transportable Database will not finish within the maintenance window

Platform migration to different endian platform

Oracle Data Pump

Recommended method

Oracle Streams

Data Pump will not finish within the maintenance window

Transportable Tablespaces

Database is using data types unsupported by Oracle Streams

Location Migration

Data Guard Redo Apply (Physical Standby Database)

Recommended method


Note:

Query the V$TRANSPORTABLE_PLATFORM view to determine the endian format of all platforms. Query the V$DATABASE view to determine the platform ID and platform name of the current system.

4.4.4.1 Transportable Database

Transportable Database is a new feature in Oracle Database 10g Release 2 (10.2) that is the recommended method for migrating an entire database to another platform that has the same endian format.

Note the following points when deciding if Transportable Database is the appropriate method to use when moving a database to another platform:

  • Transportable Database supports moving databases between platforms that are of the same endian format

  • Downtime required for a platform migration when using Transportable Database is determined by the time needed to:

    • Place the source database in read-only mode

    • Convert all data files to the new platform format

    • Transfer all data files from the source system to the target system

      You can significantly reduce this time by using a storage infrastructure that can make the data files available to the target system without the need to physically move the files.

Transportable Databases is the recommended method for migrating an entire database to another platform that has the same endian format because it's the simplest approach.

See Also:

Oracle Database Backup and Recovery Advanced User's Guide for more information on cross-platform use of Transportable Database

4.4.4.2 Oracle Streams

Oracle Streams can be used to move a database from one platform to another with minimal downtime. This is because Oracle Streams supports a configuration in which the primary database and its replica are running on different platforms.

Note the following points when deciding if Oracle Streams is an appropriate method for a platform migration:

  • Oracle Streams does not support user-defined types, such as object types, REF values, varrays, and nested tables.

  • To perform an upgrade using Oracle Streams, the source database must be running Oracle9i release 2 or higher.

  • The administrative effort required to set up and maintain the Oracle Streams environment is more than if using Data Guard SQL Apply for a database upgrade.

  • There might be a performance impact on the source database while the source and target databases run in parallel as changes are propagated to the target database.

  • Downtime required for a platform migration when using Oracle Streams is determined by the time needed to apply the remaining transactions in the queue and to reconnect clients to the new database.

Consider using Oracle Streams when the application does not use user-defined types and the extra administrative effort is worth the opportunity for a very small outage time.

See Also:

Oracle Streams Concepts and Administration for more information on database upgrading using Oracle Streams

4.4.4.3 Oracle Data Pump

Oracle Data Pump technology enables very high-speed movement of data and metadata from one database to another, across different platforms and different database versions.

Note the following points when deciding if Data Pump is an appropriate method for a platform migration:

  • Oracle Data Pump is available only on Oracle Database 10g Release 1 (10.1) and later releases.

  • Downtime required for a platform migration when using Data Pump is determined by the time needed to perform a full database network import. A network import uses a database link between the target system and the remote source system to retrieve data and write it directly into the target system, without the use of dump files.

Use Data Pump when moving a database to a platform with different endian format when the network import time is acceptable.

See Also:

4.4.4.4 Transportable Tablespaces

Transportable tablespaces can be used to accomplish a platform migration by transporting all user datafiles into a pre-created, prepared target database.

Note the following points when deciding if transportable tablespaces is the appropriate method for performing a platform migration:

  • The SYSTEM tablespace cannot be moved with transportable tablespaces. the target database SYSTEM tablespace contents, including user definitions and objects necessary for the clients, must be built manually. Use Data Pump to move the necessary contents of the SYSTEM tablespace.

  • Downtime required for a platform migration or database upgrade when using transportable tablespaces is determined by the time needed to:

    • Place the source database tablespaces in read-only mode

    • Perform a network import of the transportable metadata

    • Transfer all datafiles from the source system to the target system

      This time can be reduced significantly by using a storage infrastructure that can make the datafiles available to the target system without the need to physically move the files

    • Convert all datafiles to the new platform format using RMAN

Use transportable tablespaces to migrate to a new platform when Oracle Data Pump will not complete within the maintenance window, and Oracle Streams or Data Guard SQL Apply cannot be used due to data type restrictions.

See Also:

Oracle Database Administrator's Guide for more information on transportable tablespaces

4.4.4.5 Data Guard Redo Apply (Physical Standby Database)

Data Guard Redo Apply can be used to change the location of a database to a remote site with minimal downtime by setting up a temporary standby database at a remote location and performing a switchover operation.

Downtime required for a location migration when using Data Guard Redo Apply is determined by the time required to perform a switchover operation.

See Also:

Oracle Data Guard Concepts and Administration for more information on Redo Apply and physical standby databases

4.4.5 Online Database and Application Upgrades

An Oracle database upgrade is the process of transforming an existing, prior release of an Oracle Database system into the current release of the Oracle Database system and can be a very lengthy process. An application upgrade may include a database upgrade and any application code and schema changes required. If database upgrade with Data Guard is not applicable and zero to minimum downtime is required for the database or application upgrade, then configure Oracle Streams to perform a database upgrade with little or no downtime. To do so, you use Oracle Streams to configure a single-source replication environment with the following databases:

  • Source Database: The original database that is being upgraded

  • Capture Database: The database where a capture process captures changes made to the source database during the upgrade

  • Destination Database: The copy of the source database where an apply process applies changes made to the source database during the upgrade process. The apply process can apply to the same or different schema and object structure using

Specifically, you can use the following general steps to perform a database upgrade while the database is online:

  1. Create an empty destination database.

  2. Configure an Oracle Streams single-source replication environment where the original database is the source database and a copy of the database is the destination database for the changes made at the source.

  3. Perform the database upgrade on the destination database. During this time the original source database is available online.

  4. Use Oracle Streams to apply the changes made at the source database to the destination database.

  5. When the destination database has caught up with the changes made at the source database, take the source database offline and make the destination database available for applications and users.

If the schema or object structure is different at the destination database, then Streams transformations need to be incorporated to manipulate the change to its new structure.

Figure 4-18 provides an overview of this process.

Figure 4-18 Online Database Upgrade with Oracle Streams

Description of Figure 4-18 follows
Description of "Figure 4-18 Online Database Upgrade with Oracle Streams"

See Also:

Appendix C "Online Database Maintenance with Streams" in Oracle Streams Concepts and Administration

4.4.6 Database Object Reorganization

Many scheduled outages related to the data server involve some reorganization of the database objects. The database object reorganization must be accomplished with continued availability of the database. Oracle's online object reorganization capabilities have been available since Oracle8i. These capabilities enable object reorganization to be performed even while the underlying data is being modified.

Table 4-23 describes a few of the object reorganization capabilities available with Oracle Database 10g.

Table 4-23 Some Object Reorganization Capabilities

Object Type Example of Object Reorganization Solution Description of Solution

Table

DBMS_REDEFINITION PL/SQL package

A PL/SQL package that provides a mechanism to redefine tables online. This is Oracle's recommended best practice.

Index

Rebuild index

Rebuild an index that has previously been marked as unusable.

Tablespace

Rename tablespace

Enables an existing tablespace to be renamed without rebuilding the tablespace and its contents.


In highly available systems, it is occasionally necessary to redefine large tables that are constantly accessed to improve the performance of queries or DML. The Online Reorganization and Redefinition feature in Oracle Database 10g, offers administrators unprecedented flexibility to modify table physical attributes and transform both data and table structure, while allowing users full access to the database. This capability improves data availability, query performance, response time and disk space usage, all of which are important in a mission-critical environment and it can make the application upgrade process easier, safer and faster.

Oracle's recommended practice is to reorganize tables using the DBMS_REDEFINITION PL/SQL package, because it provides a significant increase in availability compared to traditional methods of redefining tables that require tables to be taken offline. Whether you call DBMS_REDEFINITION manually at the command line or automatically through Oracle Enterprise Manager, the entire reorganization process occurs while users have full access to the table thus ensuring system availability.

Figure 4-19 shows the Reorganize Objects Wizard in Oracle Enterprise Manager that you can use as an alternative to calling the DBMS_REDEFINITION package at the SQL*Plus command line. After you answer a few questions in the wizard, it automatically generates the script and performs the reorganization.

Figure 4-19 Database Object Reorganization Using Oracle Enterprise Manager

Description of Figure 4-19 follows
Description of "Figure 4-19 Database Object Reorganization Using Oracle Enterprise Manager"

Using the DBMS_REDEFINITION approach, an interim table is created that contains all the desired attributes. The reorganization begins by calling the procedure START_REDEF_TABLE, which is where the column mappings between the current and new version of the table are described. All the dependent objects such as triggers, constraints and indexes are automatically copied to the interim table using the procedure COPY_TABLE_DEPENDENTS. During the reorganization, any changes made to the original table are added to the interim table by calling the procedure SYNC_INTERIM_TABLE. The reorganization is complete when the procedure FINISH_REDEF_TABLE is called and the interim table is renamed as the main table.

A tablespace can be renamed in Oracle Database 10g, similar to the ability to rename a column, table and datafile. Previously, the only way to change a tablespace name was to drop and re-create the tablespace, but this meant that the contents of the tablespace had to be dropped and rebuilt later. With the ability to rename a tablespace online, there is no interruption to the users.

ALTER TABLESPACE USERS RENAME TO new_tablespace_name;

Tablespace altered.

Additionally, consider the following when performing data reorganization:

  • Concurrent activity on the table during an online operation.

    During an online operation, Oracle recommends users minimize activities on the base table. Database activities should impact less than ten percent of the table while online operation is in progress. Also the database administrator can use the Database Resource Manager to minimize the data reorganization impact to users by allocating enough resources to users.

  • Oracle does not recommend running online operations at peak times or running a batch job that modifies large amount of data during an online data reorganization.

    In fact, parallel DML, direct load and import/export cannot be performed during an online operation.

  • Rebuilding index online vs. dropping an index and then re-creating a new index online.

    Rebuilding an index online requires additional disk space for the new index during the operation, whereas dropping an index and then re-creating an index does not require additional disk space.

  • Coalescing an index online vs. rebuilding an index online.

    Online index coalesce is an in-place data reorganization operation, hence does not require additional disk space like index rebuild does. Index rebuild requires temporary disk space equal to the size of the index plus sort space during the operation. Index coalesce does not reduce the height of the B-tree. It only tries to reduce the number of leaf blocks. The coalesce operation does not free up space for users but does improve index scan performance.

    If a user needs to move an index to a new tablespace, use online index rebuild.

  • Local and global indexes.

    Oracle Database 10g supports both local and global partitioned indexes with online operations. When tables and indexes are partitioned, this allows administrators to perform maintenance on these objects, one partition at a time, while the other partitions remain online.

See Also:

4.4.7 System Maintenance

For a scheduled outage that requires an instance, node, or other component to be isolated, RAC provides the ability to relocate, disable, and enable services. Relocation migrates a service to another instance. Services and instances can be selectively disabled while repair, change, or upgrade is performed on hardware or system software and re-enabled after the maintenance is complete. This ensures that the service or instance is not started during the maintenance outage. The service and instance is disabled at the beginning of the planned outage. It is then enabled at the end of the maintenance outage.

See Also:

Oracle Database Oracle Clusterware and Oracle Real Application Clusters Administration and Deployment Guide for information about administering services with Enterprise Manager, DBCA, PL/SQL, and SRVCTL

When using RAC, Oracle Clusterware daemons start automatically at the time the node is started. When performing maintenance that requires one or more system reboots or requires that all non-operating system processes be shut down, use the crsctl command to stop and disable the startup of the Oracle Clusterware daemons. Once maintenance is complete, enable and start the Oracle Clusterware daemons with crsctl commands.