Last week I was testing our backups and got this error: ORA-15204: database version 18.104.22.168.0 is incompatible with diskgroup DATA_EDT1. This surprised me because all of our databases and Oracle homes on this Exadata are at version 22.214.171.124. I was testing full restore/recovery without a control file or an spfile. The first step was to restore spfile and controlfile from autobackup to their ASM locations.
I queried V$ASM_DISKGROUP in ASM and saw that all of our diskgroups have compatibility set to 126.96.36.199.0. In order to start RMAN restore I had created a minimal pfile. I didn’t bother setting the compatible parameter. I edited the pfile and set compatible=”188.8.131.52.0″ – that fixed it and the restore/recovery worked fine.
We have been having performance problems on our test Exadata for several months. I have opened five Oracle service requests for multiple symptoms. While the cpu utilization was fairly low, Oracle background processes would hang, the OEM 12c agent would hang, backup jobs would hang, we would experience slow communications between RAC nodes, half of the cpus would be in 100% IO wait state, the system load average would exceed 6000, etc. We noticed one of the NFS mounts was unreachable and it happened to be the mount point where we keep our DBA scripts. Processes would be in IO wait state “D” and we noticed several were the DBA scripts running from NFS. We could resolve the problem by killing these scripts. So I moved the backup scripts to local drives and eliminated some issues.
The problem kept returning though so I kept opening more SRs with no solution. Yesterday I escalated and had one of my SRs reopened. I was finally able to get to the correct person in Oracle support who gave me two things to try. The first was to add the “noac” option for the NFS mounts. The idea was that this would resolve issues where synchronous writes are induced. Since we are backing up to NFS using RMAN and tar this seemed a good bet. And it did help a lot. But we were still able to bring the problem back by tarring to NFS.
These NFS mounts are across Infiniband to an Exalogic ZFS storage system. The second fix was based on the fact that the new OEL kernel 184.108.40.206.1 update has memory management changes that may result in high TCP/IP traffic causing memory starvation for contiguous memory free space. See Knowledge base article 1546861.1 System hung with large numbers of page allocation failures with “order:5” : <Future Exadata releases will be changing the MTU size on the InfiniBand Interfaces to 7000 (down from 65520) for new installations, so the 7000 MTU for Exadata environments is known to be appropriate> So I changed the Infiniband MTU from 65520 to 7000 and restarted the network service. That finally fixed the issue.
Today we got a request to refresh a development database from a copy of production. I decided to use the canned clone process that ships with OEM 12c instead of the usual scripts or manual processes. I was pleasantly surprised with the results. There was a failure related to an NFS problem and I was able to restart the process easily.
- Minimizes the amount of scratch disk space by backing up and restoring one datafile at a time.
- No scripting or manual process required.
- Uses standard RMAN under the covers.
- Works with older versions (This one was 10.2)
- Works with existing backups or directly from the source database (This database was not previously backed up and was not in archivelog mode)
- Can duplicate a live database if it is in archivelog mode and running at version 11
- Easy to restart after errors.
- Runs as a database job in the background, unattended.
- Works on the same machine or across the network.
- More time required – one datafile at a time, no parallelization. 763G ran for 6.5 hours.
- No email notification for errors or completion.
- Black box – you really need to know how to do the steps manually and this will not teach you.
- Needs scratch space on local disks to avoid NFS performance problems. (size of largest datafile)
This morning I noticed one of the managed standby databases was in a pending shutdown state. I restarted it and verified the logs were getting applied from production. However, I noticed the WAIT_FOR_GAP status on one of the old logs. Some of the logs had not been applied and were no longer on the production system. So I found out which logs were in the gap and used RMAN to recover them in production. Oracle automatically shipped them to the standby and applied them.
Discover a gap:
1* SELECT PROCESS, CLIENT_PROCESS, SEQUENCE#, STATUS FROM V$MANAGED_STANDBY
PROCESS CLIENT_P SEQUENCE# STATUS
--------- -------- ---------- ------------
ARCH ARCH 136626 CLOSING
ARCH ARCH 0 CONNECTED
ARCH ARCH 136627 CLOSING
ARCH ARCH 136628 CLOSING
RFS LGWR 136629 IDLE
RFS UNKNOWN 0 IDLE
RFS UNKNOWN 0 IDLE
MRP0 N/A 130107 WAIT_FOR_GAP
RFS UNKNOWN 0 IDLE
Find out the extent of the gap:
1 SELECT high.thread#, "LowGap#", "HighGap#"
4 SELECT thread#, MIN(sequence#)-1 "HighGap#"
7 SELECT a.thread#, a.sequence#
10 SELECT *
11 FROM v$archived_log
12 ) a,
14 SELECT thread#, MAX(next_change#)gap1
15 FROM v$log_history
16 GROUP BY thread#
17 ) b
18 WHERE a.thread# = b.thread#
19 AND a.next_change# > gap1
21 GROUP BY thread#
22 ) high,
24 SELECT thread#, MIN(sequence#) "LowGap#"
27 SELECT thread#, sequence#
28 FROM v$log_history, v$datafile
29 WHERE checkpoint_change# <= next_change#
30 AND checkpoint_change# >= first_change#
32 GROUP BY thread#
33 ) low
34* WHERE low.thread# = high.thread#
THREAD# LowGap# HighGap#
---------- ---------- ----------
1 130106 130153
Recover the logs in Production:
RMAN> restore archivelog from sequence 130106 until sequence 130153;
I had to repeat this sequence of commands several times because there were multiple gaps.
I was re-running an rman duplicate from active database today and got this error. I had assumed that the duplicate command would overwrite the files from the previous run. Instead it created new filenames and caused the ASM disk group to fill up.
Here’s the rman message on the target system:
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-03002: failure of Duplicate Db command at 11/16/2010 10:08:27
RMAN-03015: error occurred in stored script Memory Script
RMAN-03009: failure of backup command on ORA_DISK_6 channel at 11/16/2010 10:00:25
ORA-17628: Oracle error 19505 returned by remote Oracle server
Here’s the alert log entry on the auxiliary system:
ORA-19505: failed to identify file "+data"
ORA-17502: ksfdcre:4 Failed to create file +data
ORA-15041: diskgroup "DATA" space exhausted
The target (source) is on ASM using OMF (Oracle Managed Files). The auxiliary (destination) is also using ASM and OMF. I looked through the rman log and found a lot of set newname commands like this:
executing command: SET NEWNAME
No filename is specified so it makes up new filenames, which are different from the previous run. So that’s why I ran out of space.
I used asmcmd to remove all of the files and reclaim the space.
Easy fix, but initially confusing.
I got this error this morning when I attempted to flashback a standby database to an earlier time.
flashback database to timestamp
to_timestamp(’05/01/10 04:00:00′, ‘MM/DD/YY HH24:MI:SS’)
flashback database to timestamp
ERROR at line 1:
ORA-38794: Flashback target time not in current incarnation
This is Bug 6843554 – ORA-38794 during Flashback Database on Standby previously activated with Restore Point [ID 6843554.8]. It’s fixed in 11.2 but we are on 220.127.116.11. It’s caused by “orphaned incarnations” which you can see using this query:
select * from v$database_incarnation;
There is an easy work-around, flash back to the scn.
flashback database to scn 44062315534;
Also, flashback database does a recovery so it needs the archived redo log from before the scn. We had to restore that from tape because it had already been backed up and deleted by RMAN.