`
itspace
  • 浏览: 959094 次
  • 性别: Icon_minigender_1
  • 来自: 杭州
社区版块
存档分类
最新评论

某客户数据库性能诊断报告

阅读更多

一、故障描述
***数据库近期由于业务量加大,持续出现事务响应缓慢,在业务高峰期(9:00左右)甚至出现卡住现象,导致前台无法及时响应。通过远程拨号持续观察3天,并对数据库性能做出了响应的临时性优化,在一定程度上得到了缓解。以目前数据库性能缓慢主要有两方面原因:
1、 数据库soft parse过多,导致library cache争用。
2、 数据库部分SQL执行效率不高,在大并发环境下导致热点块(hot block)争用。

二、故障分析
1、library cache争用
Library cache争用主要集中体现在2011年3月15日9:00-11:00,由于此原因(此原因在此次故障占主导地位),导致前台业务hang住,后台SQLPLUS无法连接。最终通过后台杀进程,重启数据库解决。
故障发生时,数据库主要体现为:
Cache Sizes
Begin End
Buffer Cache: 2,512M 2,512M Std Block Size: 8K
Shared Pool Size: 3,584M 3,584M Log Buffer: 14,336K

Top 5 Timed Events
Event Waits Time(s) Avg Wait(ms) % Total Call Time Wait Class
latch: library cache 4,216 1,056 251 25.6 Concurrency
db file sequential read 190,794 658 3 15.9 User I/O
db file scattered read 87,955 313 4 7.6 User I/O
read by other session 147,808 291 2 7.0 User I/O
CPU time 279 6.7
Wait Events
? s - second
? cs - centisecond - 100th of a second
? ms - millisecond - 1000th of a second
? us - microsecond - 1000000th of a second
? ordered by wait time desc, waits desc (idle events last)
Event Waits %Time -outs Total Wait Time (s) Avg wait (ms) Waits /txn
latch: library cache 4,216 0.00 1,056 251 0.39
db file sequential read 190,794 0.00 658 3 17.54
db file scattered read 87,955 0.00 313 4 8.09
read by other session 147,808 0.00 291 2 13.59
latch free 587 1.87 129 220 0.05
latch: shared pool 1,046 0.00 73 70 0.10
enq: TX - row lock contention 25 96.00 70 2813 0.00
latch: library cache lock 180 0.00 51 281 0.02
buffer busy waits 72 58.33 47 648 0.01
direct path read temp 10,960 0.00 45 4 1.01
db file parallel write 4,561 0.00 38 8 0.42
latch: cache buffers chains 2,259 0.00 34 15 0.21
latch: session allocation 97 0.00 26 265 0.01

SQL ordered by Version Count
? Only Statements with Version Count greater than 20 are displayed
Version Count Executions SQL Id SQL Module SQL Text
11,265 gdy6tymxpwjz0
  ** SQL Text Not Available **
10,658 7zw1mj3v5wqcq
  ** SQL Text Not Available **
2,194 bfp3m30c0c936
  ** SQL Text Not Available **
1,965 461mvk5ra20xg
  ** SQL Text Not Available **
892 dvqq0hguxt4ua
  SELECT COUNT(:"SYS_B_00") FROM...

? 说明:
(1) Oracle数据库shared pool设置为3,584M,过大的shared pool设置,导致过高的latch管理成本。
(2) Oracle数据库cursor_sharing参数设置为similar,此参数的设置在避免过多硬解析的同时,在表格选择性列存在柱状图的条件下,又带来执行计划产生的低效率。在version count高达11265的SQL下,又带来了library cache扫描时间的加长,又进一步加重了library cache的争用。

? 解决办法:
(1) 将Oracle数据库SGA设置为3G,即从6G降至3G,在不影响业务前提下,降低latch管理成本。
SQL> show sga

Total System Global Area         3456106496 bytes
Fixed Size                          2087968 bytes
Variable Size                    1107297248 bytes
Database Buffers                 2332033024 bytes
Redo Buffers                       14688256 bytes
(2) 将cursor_sharing参数设置为force,同时将session_cached_cursors设置为100,进一步降低library cache的争用。
通过以上临时解决办法,数据库library cache的争用得到了极大的缓解。

3、 热点块争用
目前Oracle数据库存在着严重的热点块争用,即大量的并发会话同时读取表格中同一个block,导致会话间相互等待。由于会话间等待时间较高,进而加剧了事务行级锁(enq: TX - row lock contention)的发生。目前系统中等待事件主要如下:
Top 5 Timed Events
Event Waits Time(s) Avg Wait(ms) % Total Call Time Wait Class
read by other session 1,811,592 2,797 2 50.0 User I/O
enq: TX - row lock contention 311 895 2,879 16.0 Application
db file scattered read 428,596 857 2 15.3 User I/O
CPU time 561 10.0
db file sequential read 221,017 385 2 6.9 User I/O

Instance Efficiency Percentages (Target 100%)
Buffer Nowait %: 94.72 Redo NoWait %: 99.99
Buffer Hit %: 86.97 In-memory Sort %: 100.00
Library Hit %: 99.62 Soft Parse %: 99.43
Execute to Parse %: -16.93 Latch Hit %: 90.88
Parse CPU to Parse Elapsd %: 6.05 % Non-Parse CPU: 99.91
? 说明:
由红色字体标出可以看出目前系统中存在着大量的热点块争用,而这些热点块争用是主要是由于SQL语句执行效率不高(全表扫描,Buffer Hit为86.97)导致的。以下为引起该问题的SQL。
SQL1:
select 1 from hz_qyhznr where  trim(zch)=:1   and  qylxdl  not between21 and 26。
由于表格W_HZ_QYHZNR字段ZGH没有建立函数索引trim,导致了全表扫描。
解决办法:
在字段ZGH建立了函数索引trim(zch),SQL执行效率比原先得到了极大提高,但SQL中存在判断条件qylxdl  not between21 and 26,使得执行效率不理想。
在表格W_HZ_QYHZNR存在着如下数据分布,可以看到此判断条件显得多余,如由于业务逻辑关系,必需此判断条件,建议在程序代码中加以判断,而不是在SQL中判断。
SQL> select QYLXDL,count(*) from W_HZ_QYHZNR group by QYLXDL order by QYLXDL;

QYLXDL       COUNT(*)
---------- ----------
11              86698
12              13811
13               4517
14               2074
16                  3
31              29132
32                638
33               7268
34                129

9 rows selected.

SQL2:
select 1 from hz_qyhznr where  trim(zch) =:1  and  trim(qymc) = :2
解决办法:
SQL> select count(distinct qymc) from  W_HZ_QYHZNR;

COUNT(DISTINCTQYMC)
-------------------
             149858

SQL> select count(*) from W_HZ_QYHZNR;

  COUNT(*)
----------
149858
由于在列qymc选择性较好,在其上建立函数索引,效率得到极大提高。

SQL3:
SELECT T.ZCH, T.BGZXRQ FROM BF_QYHZNR T WHERE T.ZCH = :B1 AND T.ZT = 'D';
SELECT COUNT(1) FROM BF_QYHZNR T WHERE T.ZCH = :B1 AND T.ZT = 'X';
解决办法:
SQL> conn qn_yc/xxb1002
Connected.
SQL> select count(*) from BF_QYHZNR;

  COUNT(*)
----------
    527350

SQL> select zt,count(*) from BF_QYHZNR group by zt;

ZT   COUNT(*)
-- ----------
            9
D       54164
Q          41
B      338511
X      134621
C           4

6 rows selected.
SQL> explain plan for select  T.ZCH, T.BGZXRQ FROM BF_QYHZNR T WHERE T.ZCH = :B1 AND T.ZT = 'D';

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------

----------------------------------------------------------------------
| Id  | Operation         | Name        | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------------------
|   0 | SELECT STATEMENT  |             |     1 |    24 |  9428   (1)|
|*  1 |  TABLE ACCESS FULL| W_BF_QYHZNR |     1 |    24 |  9428   (1)|
----------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------


PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
   1 - filter("T"."ZCH"=:B1 AND "T"."ZT"='D')
可以看到ZT列选择性不佳,建议在程序代码中加以判断,而不是在SQL中判断。
否则执行计划为全表扫描,容易导致热点块冲突。
临时解决办法:
SQL> conn hz_yc/xxb1002
Connected.
SQL> select count(*) from W_BF_QYHZNR;

  COUNT(*)
----------
    527350

SQL> create index W_BF_QYHZNR_zch_zt on W_BF_QYHZNR(zch,zt);

Index created.
SQL> conn qn_yc/xxb1002
Connected.
SQL> explain plan for select  T.ZCH, T.BGZXRQ FROM BF_QYHZNR T WHERE T.ZCH = :B1 AND T.ZT = 'D';

Explained.

SQL>  select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
-------
| Id  | Operation                   | Name               | Rows  | Bytes | Cost
(%CPU)|
--------------------------------------------------------------------------------
-------
|   0 | SELECT STATEMENT            |                    |     1 |    24 |     4
   (0)|
|   1 |  TABLE ACCESS BY INDEX ROWID| W_BF_QYHZNR        |     1 |    24 |     4
   (0)|

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
|*  2 |   INDEX RANGE SCAN          | W_BF_QYHZNR_ZCH_ZT |     1 |       |     3
   (0)|
--------------------------------------------------------------------------------
-------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("T"."ZCH"=:B1 AND "T"."ZT"='D')
目前在T"."ZCH"=:B1 AND "T"."ZT"='D'建立联合索引,执行计划效率较高。

SQL4:
仔细检查存储过程CALL pkg_check.chk_grade(:1,:2,:3),根据统计资料显示,该执行计划是引起数据库性能缓慢的主要原因,需要开发商进一步优化。
RECOMMENDATION 1: SQL Tuning, 64% benefit (3566 seconds)
      ACTION: Investigate the SQL statement with SQL_ID "57xrg07tjm0qn" for possible performance improvements.
              RELEVANT OBJECT: SQL statement with SQL_ID 57xrg07tjm0qn
               CALL pkg_check.chk_grade(:1,:2,:3)
    RATIONALE: SQL statement with SQL_ID "57xrg07tjm0qn" was executed 30
         times and had an average elapsed time of 117 seconds.
      RATIONALE: Waiting for event "read by other session" in wait class "User
         I/O" accounted for 49% of the database time spent in processing the
         SQL statement with SQL_ID "57xrg07tjm0qn".
      RATIONALE: Waiting for event "enq: TX - row lock contention" in wait
         class "Application" accounted for 26% of the database time spent in
         processing the SQL statement with SQL_ID "57xrg07tjm0qn".
      RATIONALE: Waiting for event "db file scattered read" in wait class
         "User I/O" accounted for 9% of the database time spent in processing
         the SQL statement with SQL_ID "57xrg07tjm0qn".

分享到:
评论
1 楼 hctech 2011-06-14  
关于version count过高的问题,不知博主是否看过eygle的这篇文章:
http://www.eygle.com/archives/2004/10/shared_pool-6.html

不知是否是一样的问题?

相关推荐

Global site tag (gtag.js) - Google Analytics