`
itspace
  • 浏览: 961063 次
  • 性别: Icon_minigender_1
  • 来自: 杭州
社区版块
存档分类
最新评论

Oracle RAC之节点故障:File table overflow

阅读更多
某客户数据库于2010年4月26日早晨9点左右发生单节点故障,后台故障表现为一节点数据库(hisdb01)异常终止,进一步导致一节点主机重启。前台故障表现为部分业务不可用。由于没有部署主机性能跟踪脚本,只能根据现场日志描述初步推断为主机资源不足(如文件句柄没有释放)从而导致Oracle实例异常终止。2010年6月7日早晨9点再次发生单节点故障。
后台日志分析
查看发生故障前后各种日志
1、 操作系统日志
引用
Jun  7 09:08:18 hisdb01 cmcld[7603]: Unable to accept a connection: File table overflow
Jun  7 09:08:18 hisdb01 cmclconfd[2878]: Unable to allocate a socket: File table overflow
Jun  7 09:08:18 hisdb01 cmclconfd[2878]: Unable to open /etc/cmcluster/cmclconfig, File table overflow
Jun  7 09:08:18 hisdb01 cmcld[7603]: Unable to accept a connection: File table overflow
Jun  7 09:08:18 hisdb01 cmclconfd[2878]: Unable to resolve local hostname hisdb01 to determine the domain name
Jun  7 09:08:18 hisdb01 cmclconfd[2878]: Unable to allocate a socket: File table overflow
Jun  7 09:08:19 hisdb01 cmcld[7603]: Sending file $SGRUN/frdump.cmcld.8 (167257 bytes) to file assistant daemon.
Jun  7 09:08:18 hisdb01 cmclconfd[2878]: Unable to open /etc/cmcluster/cmclconfig, File table overflow
Jun  7 09:08:19 hisdb01  above message repeats 3 times
Jun  7 09:08:19 hisdb01 cmfileassistd[2894]: Updated file /var/adm/cmcluster/frdump.cmcld.8 (length = 167257).
Jun  7 09:09:00 hisdb01 inetd[1018]: hacl-cfg/tcp: accept: File table overflow
Jun  7 09:09:19 hisdb01 cmcld[7603]: Service cmfileassistd terminated due to an exit(0).
Jun  7 09:12:16 hisdb01 syslog: Unable to open the /etc/utmpx file, to sync the records from file->/usr/sbin/utmpd
Jun  7 09:12:17 hisdb01 vmunix: file: table is full
Jun  7 09:12:17 hisdb01  above message repeats 13576 times
Jun  7 09:12:17 hisdb01 vmunix: file: table is full
Jun  7 09:12:17 hisdb01 syslogd: utmp database: Bad file number
Jun  7 09:12:17 hisdb01 vmunix: file: table is full
Jun  7 09:12:17 hisdb01  above message repeats 10 times
Jun  7 09:12:17 hisdb01 vmunix: file: table is full
Jun  7 09:12:17 hisdb01 vmunix: file: table is full
Jun  7 09:12:17 hisdb01  above message repeats 17 times
Jun  7 09:12:17 hisdb01 vmunix: file: table is full
Jun  7 09:12:17 hisdb01 vmunix: file: table is full


2、crs后台日志:
引用
2010-06-06 21:15:09.225: [  CRSEVT][167223] CAAMonitorHandler :: 0:Could not execute /oracle/app/product/db10g/bin/racgwrap(check) for ora.orcl.orcl1.inst
category: 1234, operation: scls_process_spawn, loc: out_pipe, OS error: 23, other: out of memory

2010-06-06 21:15:09.225: [  CRSAPP][167223] CheckResource error for ora.orcl.orcl1.inst error code = -1
2010-06-06 21:15:19.211: [  CRSEVT][167224] CAAMonitorHandler :: 0:Could not execute /oracle/app/product/crs/bin/racgwrap(check) for ora.hisdb01.ons
category: 1234, operation: scls_process_spawn, loc: out_pipe, OS error: 23, other: out of memory

2010-06-06 21:15:19.211: [  CRSAPP][167224] CheckResource error for ora.hisdb01.ons error code = -1
2010-06-06 21:16:18.020: [  CRSEVT][167225] CAAMonitorHandler :: 0:Could not execute /oracle/app/product/db10g/bin/racgwrap(check) for ora.hisdb01.ASM1.asm
category: 1234, operation: scls_process_spawn, loc: out_pipe, OS error: 23, other: out of memory

2010-06-06 21:16:18.021: [  CRSAPP][167225] CheckResource error for ora.hisdb01.ASM1.asm error code = -1


3、实例orcl1日志:
引用
Sun Jun  6 21:08:42 2010
Errors in file /oracle/app/product/admin/orcl/udump/orcl1_ora_2915.trc:
ORA-00603: Message 603 not found; No message file for product=RDBMS, facility=ORA
ORA-27544: Message 27544 not found; No message file for product=RDBMS, facility=ORA
ORA-27300: Message 27300 not found; No message file for product=RDBMS, facility=ORA; arguments: [socket] [23]
ORA-27301: Message 27301 not found; No message file for product=RDBMS, facility=ORA; arguments: [File table overflow]
ORA-27302: Message 27302 not found; No message file for product=RDBMS, facility=ORA; arguments: [sskgxpcre1]

Sun Jun  6 21:40:52 2010
WARNING: kfk failed to open a disk[/dev/vgdata/rasm_disk5]
Sun Jun  6 21:40:52 2010
Errors in file /oracle/app/product/admin/orcl/udump/orcl1_ora_4809.trc:
ORA-15025: could not open disk '/dev/vgdata/rasm_disk5'
ORA-27041: unable to open file
HPUX-ia64 Error: 23: File table overflow
Additional information: 3
Sun Jun  6 21:40:52 2010
WARNING: kfk failed to open a disk[/dev/vgdata/rasm_disk5]
Sun Jun  6 21:40:52 2010
Errors in file /oracle/app/product/admin/orcl/udump/orcl1_ora_4809.trc:
ORA-15025: could not open disk '/dev/vgdata/rasm_disk5'
ORA-27041: unable to open file
HPUX-ia64 Error: 23: File table overflow
Additional information: 3

4、实例asm1后台日志:
引用
Sun Jun  6 21:14:26 2010
Errors in file /oracle/app/product/admin/+ASM/udump/+asm1_ora_3254.trc:
ORA-00603: Message 603 not found; No message file for product=RDBMS, facility=ORA
ORA-27504: Message 27504 not found; No message file for product=RDBMS, facility=ORA
ORA-27300: Message 27300 not found; No message file for product=RDBMS, facility=ORA; arguments: [ioctl] [23]
ORA-27301: Message 27301 not found; No message file for product=RDBMS, facility=ORA; arguments: [File table overflow]
ORA-27302: Message 27302 not found; No message file for product=RDBMS, facility=ORA; arguments: [skgxpvaddr1]

5、查看故障发生前nfile使用情况
引用
root@hisdb01:/sbin/init.d # kcusage nfile
Tunable                 Usage / Setting     
=============================================
nfile                   51795 / 65536

6、查看imon_orcl1.log
引用
2010-06-17 17:38:17.168: [    RACG][30] [9233][30][ora.orcl.orcl1.inst]: GIMH: GIM-00104: Health check failed to connect to instance.
GIM-00090: OS-dependent operation:mmap failed with status: 12
GIM-00091: OS failure message: Not enough space
GIM-00092: OS failure occurred at: sskgmsmr_13

2010-06-17 17:39:17.178: [    RACG][30] [9233][30][ora.orcl.orcl1.inst]: GIMH: GIM-00104: Health check failed to connect to instance.
GIM-00090: OS-dependent operation:mmap failed with status: 12
GIM-00091: OS failure message: Not enough space
"/oracle/app/product/db10g/log/hisdb01/racg/imon_orcl.log" 158031 lines, 9229057 characters
GIM-00090: OS-dependent operation:mmap failed with status: 12
GIM-00091: OS failure message: Not enough space
GIM-00092: OS failure occurred at: sskgmsmr_13


从以上日志可以看出(红色部分标出),很可能是Oracle受操作系统资源限制引发的故障。进一步查看故障发生前后操作系统资源利用情况。
1、 查看nfile使用情况
引用
root@hisdb02:/ # kcusage  nfile
Tunable                 Usage / Setting     
=============================================
nfile                   12089 / 65536


2、 查看主机内存,CPU资源
引用
zzz ***Sun Jun 6 21:17:20 EAT 2010
         procs           memory                   page                              faults       cpu
    r     b     w      avm    free   re   at    pi   po    fr   de    sr     in     sy    cs  us sy id
    1     0     0  2311458  1996119  172   18     0    0     0    0     0   2620  21313   834   0  1 99
    1     0     0  2311458  1996103  191   21     0    0     0    0     0   2408  16170   709   0  1 99
    1     0     0  2311458  1995210  166   18     0    0     0    0     0   2403  14823   700   1  0 99
zzz ***Sun Jun 6 21:17:30 EAT 2010
         procs           memory                   page                              faults       cpu
    r     b     w      avm    free   re   at    pi   po    fr   de    sr     in     sy    cs  us sy id
    1     0     0  2285994  1996297  172   18     0    0     0    0     0   2620  21313   834   0  1 99
    1     0     0  2285994  1996297  171   20     0    0     0    0     0   2426  11112   710   1  1 98
    1     0     0  2285994  1995404  150   17     0    0     0    0     0   2398  10711   694   0  1 99
zzz ***Sun Jun 6 21:17:40 EAT 2010
         procs           memory                   page                              faults       cpu
    r     b     w      avm    free   re   at    pi   po    fr   de    sr     in     sy    cs  us sy id
    2     0     0  2196419  1996297  172   18     0    0     0    0     0   2620  21313   834   0  1 99
    2     0     0  2196419  1995404  170   19     0    0     0    0     0   2372  10075   698   0  1 99
    2     0     0  2196419  1995386  149   17     0    0     0    0     0   2380  10401   715   0  0 100


3、 查看磁盘io情况
引用
zzz ***Sun Jun 6 21:06:37 EAT 2010

  device    bps     sps    msps 

  c1t0d0      0     0.0     1.0 
  c6t0d1      0     0.0     1.0 
  c6t0d2      0     0.0     1.0 
  c6t0d3      0     0.0     1.0 
  c6t0d4      0     0.0     1.0 
  c6t0d5      0     0.0     1.0 
  c8t0d1      0     0.0     1.0 
  c8t0d2      0     0.0     1.0 
  c8t0d3      0     0.0     1.0 
  c8t0d4      0     0.0     1.0 
  c8t0d5      0     0.0     1.0 
c10t0d1      0     0.0     1.0 
c10t0d2      0     0.0     1.0 
c10t0d3      0     0.0     1.0 
c10t0d4      0     0.0     1.0 
c10t0d5      0     0.0     1.0 
c12t0d1      0     0.0     1.0 
c12t0d2      0     0.0     1.0 
c12t0d3      0     0.0     1.0 
c12t0d4      0     0.0     1.0 
c12t0d5      0     0.0     1.0 
  c6t0d6      0     0.0     1.0 
  c6t0d7      0     0.0     1.0 
  c6t1d0      0     0.0     1.0 
  c6t1d1      0     0.0     1.0 
  c6t1d2      0     0.0     1.0 
  c6t1d3      0     0.0     1.0 
  c8t0d6      0     0.0     1.0 
  c8t0d7      0     0.0     1.0 
  c8t1d0      0     0.0     1.0 
  c8t1d1      0     0.0     1.0 
  c8t1d2      0     0.0     1.0 
  c8t1d3      0     0.0     1.0 
c10t0d6      0     0.0     1.0 
c10t0d7      0     0.0     1.0 
c10t1d0      0     0.0     1.0 
c10t1d1      0     0.0     1.0 
c10t1d2      0     0.0     1.0 
c10t1d3      0     0.0     1.0


从以上三项可以基本初步评估主机在故障发生前后的资源使用情况,可以明确的看到,在发生故障时,主机资源比较空闲。
基于此类故障,在主机资源充足的情况下,发生资源争夺(如不能获得文件句柄),很可能于Oracle bug有关。查阅Oracle 官方文档,又一未公布bug( unpublished Bug 6931689)与此故障极为类似,详见metalink doc 739557.1。
此bug主要发生的平台为:
引用
HP-UX PA-RISC (64-bit)
HP-UX Itanium
HP IA64 HPUNIXHP 9000 Series HP-UX (64-bit)

数据库版本为:10.2.0.3 to 11.1.0.6
引用
- 10.2.0.3, 10.2.0.3 +  CRS Bundle Patch #2  or CRS Bundle Patch #3
- 10.2.0.4
- 11.1.0.6


解决方法为:
在目前版本的基础上,打下列补丁之一
引用
- CRS 10.2.0.4 Bundle Patch #2 (Patch 7493592) or above. See Note 405820.1
- Latest 10.2.0.4 CRS PSU Patch as per Note 756671.1
The fix has to be applied to both CRS and RAC Database home to fix the problem.
The BUG is fixed in 11.1.0.7 and will be fixed in 10.2.0.5.

建议:
1、目前数据库版本为10.2.0.4,可以在此补丁基础上应用最新的psu patch(10.2.0.4.4)
2、调大参数nfile至131072。
0
1
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics