Getting ORA-15204 on full database restore

Last week I was testing our backups and got this error: ORA-15204: database version 11.2.0.0.0 is incompatible with diskgroup DATA_EDT1. This surprised me because all of our databases and Oracle homes on this Exadata are at version 11.2.0.3. I was testing full restore/recovery without a control file or an spfile. The first step was to restore spfile and controlfile from autobackup to their ASM locations.
I queried V$ASM_DISKGROUP in ASM and saw that all of our diskgroups have compatibility set to 11.2.0.2.0. In order to start RMAN restore I had created a minimal pfile. I didn’t bother setting the compatible parameter. I edited the pfile and set compatible=”11.2.0.3.0″ – that fixed it and the restore/recovery worked fine.

 

 

.rman

Advertisements

High load average, high IO wait, slow process startup, slow RAC communications

We have been having performance problems on our test Exadata for several months. I have opened five Oracle service requests for multiple symptoms. While the cpu utilization was fairly low, Oracle background processes would hang, the OEM 12c agent would hang, backup jobs would hang, we would experience slow communications between RAC nodes, half of the cpus would be in 100% IO wait state, the system load average would exceed 6000, etc. We noticed one of the NFS mounts was unreachable and it happened to be the mount point where we keep our DBA scripts. Processes would be in IO wait state “D” and we noticed several were the DBA scripts running from NFS. We could resolve the problem by killing these scripts. So I moved the backup scripts to local drives and eliminated some issues.
The problem kept returning though so I kept opening more SRs with no solution. Yesterday I escalated and had one of my SRs reopened. I was finally able to get to the correct person in Oracle support who gave me two things to try. The first was to add the “noac” option for the NFS mounts. The idea was that this would resolve issues where synchronous writes are induced. Since we are backing up to NFS using RMAN and tar this seemed a good bet. And it did help a lot. But we were still able to bring the problem back by tarring to NFS.
These NFS mounts are across Infiniband to an Exalogic ZFS storage system. The second fix was based on the fact that the new OEL kernel 11.2.3.2.1 update has memory management changes that may result in high TCP/IP traffic causing memory starvation for contiguous memory free space. See Knowledge base article 1546861.1 System hung with large numbers of page allocation failures with “order:5” : <Future Exadata releases will be changing the MTU size on the InfiniBand Interfaces to 7000 (down from 65520) for new installations, so the 7000 MTU for Exadata environments is known to be appropriate>  So I changed the Infiniband MTU from 65520 to 7000 and restarted the network service. That finally fixed the issue.

Using OEM 12c canned clone process

Today we got a request to refresh a development database from a copy of production. I decided to use the canned clone process that ships with OEM 12c instead of the usual scripts or manual processes. I was pleasantly surprised with the results. There was a failure related to an NFS problem and I was able to restart the process easily.

Pro:

  • Minimizes the amount of scratch disk space by backing up and restoring one datafile at a time.
  • No scripting or manual process required.
  • Uses standard RMAN under the covers.
  • Works with older versions (This one was 10.2)
  • Works with existing backups or directly from the source database (This database was not previously backed up and was not in archivelog mode)
  • Can duplicate a live database if it is in archivelog mode and running at version 11
  • Easy to restart after errors.
  • Runs as a database job in the background, unattended.
  • Works on the same machine or across the network.

Con:

  • More time required – one datafile at a time, no parallelization. 763G ran for 6.5 hours.
  • No email notification for errors or completion.
  • Black box – you really need to know how to do the steps manually and this will not teach you.
  • Needs scratch space on local disks to avoid NFS performance problems.  (size of largest datafile)

 

WAIT_FOR_GAP status on standby

This morning I noticed one of the managed standby databases was in a pending shutdown state. I restarted it and verified the logs were getting applied from production. However, I noticed the WAIT_FOR_GAP status on one of the old logs. Some of the logs had not been applied and were no longer on the production system. So I found out which logs were in the gap and used RMAN to recover them in production. Oracle automatically shipped them to the standby and applied them.

Discover a gap:
 1* SELECT PROCESS, CLIENT_PROCESS, SEQUENCE#, STATUS FROM V$MANAGED_STANDBY 
PROCESS   CLIENT_P  SEQUENCE# STATUS
--------- -------- ---------- ------------
ARCH      ARCH         136626 CLOSING
ARCH      ARCH              0 CONNECTED
ARCH      ARCH         136627 CLOSING
ARCH      ARCH         136628 CLOSING
RFS       LGWR         136629 IDLE
RFS       UNKNOWN           0 IDLE
RFS       UNKNOWN           0 IDLE
MRP0      N/A          130107 WAIT_FOR_GAP
RFS       UNKNOWN           0 IDLE

Find out the extent of the gap:

  1  SELECT high.thread#, "LowGap#", "HighGap#"
  2  FROM
  3       (
  4       SELECT thread#, MIN(sequence#)-1 "HighGap#"
  5       FROM
  6       (
  7           SELECT a.thread#, a.sequence#
  8           FROM
  9           (
 10               SELECT *
 11               FROM v$archived_log
 12           ) a,
 13           (
 14               SELECT thread#, MAX(next_change#)gap1
 15               FROM v$log_history
 16               GROUP BY thread#
 17           ) b
 18           WHERE a.thread# = b.thread#
 19           AND a.next_change# > gap1
 20       )
 21       GROUP BY thread#
 22   ) high,
 23   (
 24       SELECT thread#, MIN(sequence#) "LowGap#"
 25       FROM
 26       (
 27           SELECT thread#, sequence#
 28           FROM v$log_history, v$datafile
 29           WHERE checkpoint_change# <= next_change#
 30           AND checkpoint_change# >= first_change#
 31       )
 32       GROUP BY thread#
 33   ) low
 34*  WHERE low.thread# = high.thread#
11:13:08 BIPROD1SYS>/
   THREAD#    LowGap#   HighGap#
---------- ---------- ----------
         1     130106     130153

Recover the logs in Production:

RMAN>  restore archivelog from sequence 130106 until sequence 130153;

I had to repeat this sequence of commands several times because there were multiple gaps.

Oracle error 19505 returned by remote Oracle server

I was re-running an rman duplicate from active database today and got this error. I had assumed that the duplicate command would overwrite the files from the previous run. Instead it created new filenames and caused the ASM disk group to fill up.

Here’s the rman message on the target system:

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 11/16/2010 10:08:27
RMAN-03015: error occurred in stored script Memory Script
RMAN-03009: failure of backup command on ORA_DISK_6 channel at 11/16/2010 10:00:25
ORA-17628: Oracle error 19505 returned by remote Oracle server

Here’s the alert log entry on the auxiliary system:

ORA-19505: failed to identify file "+data"
ORA-17502: ksfdcre:4 Failed to create file +data
ORA-15041: diskgroup "DATA" space exhausted

The target (source) is on ASM using OMF (Oracle Managed Files). The auxiliary (destination) is also using ASM and OMF. I looked through the rman log and found a lot of set newname commands like this:

executing command: SET NEWNAME

No filename is specified so it makes up new filenames, which are different from the previous run. So that’s why I ran out of space.

I used asmcmd to remove all of the files and reclaim the space.

Easy fix, but initially confusing.

ORA-38794: Flashback target time not in current incarnation

I got this error this morning when I attempted to flashback a standby database to an earlier time.

 flashback database to timestamp
   to_timestamp(’05/01/10 04:00:00′, ‘MM/DD/YY HH24:MI:SS’)
SQL> /
flashback database to timestamp
*
ERROR at line 1:
ORA-38794: Flashback target time not in current incarnation

This is Bug 6843554 – ORA-38794 during Flashback Database on Standby previously activated with Restore Point [ID 6843554.8]. It’s fixed in 11.2 but we are on 11.1.0.7. It’s caused by “orphaned incarnations” which you can see using this query:
select * from v$database_incarnation;

There is an easy work-around, flash back to the scn.
 flashback database to scn 44062315534;

Also, flashback database does a recovery so it needs the archived redo log from before the scn. We had to restore that from tape because it had already been backed up and deleted by RMAN.