19 April 2015

梦醒十分

清晨明媚的阳光透过窗帘间隙射入房间,伴随着手机的闹铃声,我揉了揉双眼,(不对,周末我是不设置闹铃的,那是什么),一种焦虑的情绪油然而生,随后精神抖擞的查看手机:

02:57:58 [C] xxxxxhost[ET2] xxxxx010010010127.n.et2 system (load1:240.2)

03:29:27 [C] xxxxxhost[ET2] xxxxx010010010127.n.et2 check_http (time:0 state:2)http健康检查不通过

......

果不其然,是持续的告警,随着一次02:57:58的load过高,有台机器的应用服务貌似挂了,后续有不停的告警短信。

消失的进程

匆匆来到飘窗前,启动电脑,泡上一杯不怎么有品的绿茶,掏出手机:登录系统、登录VPC、登录跳板机、登录目标机器,一切行云流水一气呵成,but,接下来的情况让这一个上午就这么过去,不可轻敌,不可轻敌啊~

查看jvm进程: ps -ef|grep java

没有!

查看其他机器的jvm进程确认dump包路径配置: -XX:HeapDumpPath=/home/admin/logs/java.hprof

切换到目标机器查看dump文件,没有!!

风筝断了线,查看应用日志和nginx日志(grep "2015:02:57" /home/admin/cai/logs/cronolog/2015/04/2015-04-19-taobao-access_log),确认了在02:57之后应用日志全无信息,nginx日志也没有了信息,说明jvm消失的同时,机器正常退出集群不再提供服务,这时少了一份焦虑,可以安心找问题了。

上帝之手

时间退回若干年前,那时GG还能够自如的访问,项目使用的机器为了节约成本使用的1U、32位、2G内存,曾经为移动内控审计做搜索审核应用的我曾经被这称为上帝之手的神一般的招数指点过。如初一折,tomcat进程挂,没有异常日志,没有core,没有一丝线索。误打误撞在经过数小时GG之后,确认是Linux系统通过“上帝之手”使它消失了~

这里说的上帝之手就是Linux系统的保护机制,如果系统即将无法承载运行其上的程序,比如load过重、内存不能再分配时,会由系统主动杀掉应用程序,俗称:OOM killer。

是,就是她~

查看上帝之手的痕迹: sudo dmesg|grep -i kill|less

[28039563.695579] Out of memory (oom_kill_allocating_task): Kill process 4057 (DragoonAgent) score 0 or sacrifice child
[28039563.706553] Killed process 174493, UID 120063, (get_busbar_volt) total-vm:65924kB, anon-rss:168kB, file-rss:580kB
[28039563.717620] Out of memory (oom_kill_allocating_task): Kill process 4403 (td_connector) score 0 or sacrifice child
[28039563.741738] Killed process 4403, UID 0, (td_connector) total-vm:539836kB, anon-rss:340kB, file-rss:964kB
[28039563.785043] Out of memory (oom_kill_allocating_task): Kill process 4050 (DragoonAgent) score 0 or sacrifice child
[28039563.796011] Killed process 4050, UID 0, (DragoonAgent) total-vm:266992kB, anon-rss:880kB, file-rss:772kB
[28039563.807471] Out of memory (oom_kill_allocating_task): Kill process 4041 (DragoonAgent) score 0 or sacrifice child
[28039563.818456] Killed process 4041, UID 0, (DragoonAgent) total-vm:170336kB, anon-rss:308kB, file-rss:668kB
[28039563.829292] Out of memory (oom_kill_allocating_task): Kill process 4391 (watchdog_tdc.py) score 0 or sacrifice child
[28039563.840537] Killed process 4392, UID 0, (td_connector) total-vm:539836kB, anon-rss:340kB, file-rss:964kB
[28039563.854190] Out of memory (oom_kill_allocating_task): Kill process 51649 (DragoonAgent) score 0 or sacrifice child
[28039563.865263] Killed process 51649, UID 0, (DragoonAgent) total-vm:170336kB, anon-rss:296kB, file-rss:660kB
[28039563.882654] Out of memory (oom_kill_allocating_task): Kill process 21584 (DragoonAgent) score 0 or sacrifice child
[28039563.893734] Killed process 21584, UID 0, (DragoonAgent) total-vm:170336kB, anon-rss:348kB, file-rss:660kB
[28039563.904209] Out of memory (oom_kill_allocating_task): Kill process 174549 (crond) score 0 or sacrifice child
[28039563.914752] Killed process 174549, UID 0, (crond) total-vm:102004kB, anon-rss:520kB, file-rss:556kB
[28039563.924729] Out of memory (oom_kill_allocating_task): Kill process 260704 (java) score 0 or sacrifice child
[28039563.935457] Killed process 17884, UID 578, (java) total-vm:4756208kB, anon-rss:3088368kB, file-rss:832kB
[28039563.957518] Out of memory (oom_kill_allocating_task): Kill process 174548 (crond) score 0 or sacrifice child
[28039563.968120] Killed process 174548, UID 0, (crond) total-vm:102004kB, anon-rss:520kB, file-rss:556kB
[28039563.978443] Out of memory (oom_kill_allocating_task): Kill process 198135 (java) score 0 or sacrifice child
[28039563.988934] Killed process 198135, UID 689, (java) total-vm:3079880kB, anon-rss:2428916kB, file-rss:788kB
[28039564.007801] Out of memory (oom_kill_allocating_task): Kill process 174494 (tsar) score 0 or sacrifice child
[28039564.018254] Killed process 174494, UID 0, (tsar) total-vm:6264kB, anon-rss:432kB, file-rss:356kB
[28039564.042353] Out of memory (oom_kill_allocating_task): Kill process 4016 (DragoonAgent) score 0 or sacrifice child
[28039564.053457] Killed process 4033, UID 0, (DragoonAgent) total-vm:34244kB, anon-rss:1568kB, file-rss:724kB
[28039564.105082] Out of memory (oom_kill_allocating_task): Kill process 255982 (java) score 0 or sacrifice child
[28039564.115580] Killed process 255982, UID 686, (java) total-vm:5494328kB, anon-rss:3315848kB, file-rss:776kB
[28039564.126503] Out of memory (oom_kill_allocating_task): Kill process 174524 (bash) score 0 or sacrifice child
[28039564.137572] Killed process 174524, UID 0, (bash) total-vm:65924kB, anon-rss:260kB, file-rss:124kB
[28039564.147704] Out of memory (oom_kill_allocating_task): Kill process 174343 (bash) score 0 or sacrifice child
[28039564.158286] Killed process 174343, UID 0, (bash) total-vm:65924kB, anon-rss:260kB, file-rss:120kB
[28039564.168944] Out of memory (oom_kill_allocating_task): Kill process 117302 (java) score 0 or sacrifice child
[28039564.179444] Killed process 117302, UID 553, (java) total-vm:6004256kB, anon-rss:4836988kB, file-rss:1284kB
[28039564.196873] Out of memory (oom_kill_allocating_task): Kill process 199064 (DragoonAgent) score 0 or sacrifice child
[28039564.208037] Killed process 199064, UID 0, (DragoonAgent) total-vm:267016kB, anon-rss:940kB, file-rss:772kB
[28039564.221509] Out of memory (oom_kill_allocating_task): Kill process 4016 (DragoonAgent) score 0 or sacrifice child
[28039564.232664] Killed process 4034, UID 0, (DragoonAgent) total-vm:170336kB, anon-rss:308kB, file-rss:668kB
[28039564.246970] Out of memory (oom_kill_allocating_task): Kill process 105674 (java) score 0 or sacrifice child
[28039564.257520] Killed process 105674, UID 513, (java) total-vm:6798240kB, anon-rss:4997820kB, file-rss:824kB
[28039564.272282] Out of memory (oom_kill_allocating_task): Kill process 3692 (java) score 0 or sacrifice child
[28039564.282571] Killed process 3692, UID 514, (java) total-vm:803792kB, anon-rss:247348kB, file-rss:760kB
[28039564.293882] Out of memory (oom_kill_allocating_task): Kill process 174558 (bash) score 0 or sacrifice child
[28039564.304333] Killed process 174558, UID 0, (bash) total-vm:66056kB, anon-rss:272kB, file-rss:80kB
[28039564.314285] Out of memory (oom_kill_allocating_task): Kill process 152580 (java) score 0 or sacrifice child
[28039564.324774] Killed process 152580, UID 512, (java) total-vm:5554124kB, anon-rss:4668184kB, file-rss:1328kB
[28039564.338605] Out of memory (oom_kill_allocating_task): Kill process 131546 (DragoonAgent) score 0 or sacrifice child
[28039564.349778] Killed process 131547, UID 0, (DragoonAgent) total-vm:34184kB, anon-rss:1556kB, file-rss:684kB
[28039564.360323] Out of memory (oom_kill_allocating_task): Kill process 113851 (java) score 0 or sacrifice child
[28039564.370793] Killed process 113851, UID 514, (java) total-vm:830064kB, anon-rss:133872kB, file-rss:736kB
[28039564.381018] Out of memory (oom_kill_allocating_task): Kill process 4016 (DragoonAgent) score 0 or sacrifice child
[28039564.392089] Killed process 4038, UID 0, (DragoonAgent) total-vm:170336kB, anon-rss:312kB, file-rss:668kB
[28039564.408267] Out of memory (oom_kill_allocating_task): Kill process 199052 (DragoonAgent) score 0 or sacrifice child
[28039564.419419] Killed process 199052, UID 0, (DragoonAgent) total-vm:170336kB, anon-rss:300kB, file-rss:660kB
[28039564.438057] Out of memory (oom_kill_allocating_task): Kill process 2895 (java) score 0 or sacrifice child
[28039564.448458] Killed process 2895, UID 59844, (java) total-vm:487192kB, anon-rss:83076kB, file-rss:792kB
[28039564.471120] Out of memory (oom_kill_allocating_task): Kill process 37715 (java) score 0 or sacrifice child
[28039564.481599] Killed process 37715, UID 561, (java) total-vm:6526856kB, anon-rss:5386220kB, file-rss:1304kB
[28039564.492579] Out of memory (oom_kill_allocating_task): Kill process 48766 (java) score 0 or sacrifice child
[28039564.502995] Killed process 48766, UID 664, (java) total-vm:6031200kB, anon-rss:5068312kB, file-rss:1296kB
[28039564.516763] Out of memory (oom_kill_allocating_task): Kill process 65512 (java) score 0 or sacrifice child
[28039564.527199] Killed process 65512, UID 643, (java) total-vm:6242024kB, anon-rss:3375488kB, file-rss:1280kB
[28039564.540955] Out of memory (oom_kill_allocating_task): Kill process 167800 (DragoonAgent) score 0 or sacrifice child
[28039564.552162] Killed process 171748, UID 120063, (sec_net_log) total-vm:67952kB, anon-rss:200kB, file-rss:588kB
[28039564.570315] Out of memory (oom_kill_allocating_task): Kill process 171144 (java) score 0 or sacrifice child
[28039564.580870] Killed process 171144, UID 577, (java) total-vm:7372320kB, anon-rss:6192336kB, file-rss:1320kB
[28039564.593497] Out of memory (oom_kill_allocating_task): Kill process 171642 (java) score 0 or sacrifice child
[28039564.603968] Killed process 171642, UID 570, (java) total-vm:7270104kB, anon-rss:6485948kB, file-rss:924kB
[28039564.622703] Out of memory (oom_kill_allocating_task): Kill process 174557 (DragoonAgent) score 0 or sacrifice child
[28039564.633861] Killed process 174557, UID 0, (DragoonAgent) total-vm:201520kB, anon-rss:1228kB, file-rss:104kB
[28039564.644548] Out of memory (oom_kill_allocating_task): Kill process 34356 (java) score 0 or sacrifice child
[28039564.654949] Killed process 34356, UID 647, (java) total-vm:3213964kB, anon-rss:1898792kB, file-rss:756kB
[28039564.677242] Out of memory (oom_kill_allocating_task): Kill process 178247 (java) score 0 or sacrifice child
[28039564.687705] Killed process 178247, UID 59844, (java) total-vm:496812kB, anon-rss:87332kB, file-rss:752kB
[28039564.700958] Out of memory (oom_kill_allocating_task): Kill process 27494 (python) score 0 or sacrifice child
[28039564.711507] Killed process 27494, UID 0, (python) total-vm:279584kB, anon-rss:14948kB, file-rss:1152kB
[28039564.721906] Out of memory (oom_kill_allocating_task): Kill process 248396 (houyi_superviso) score 0 or sacrifice child
[28039564.733358] Killed process 248437, UID 0, (python) total-vm:182292kB, anon-rss:712kB, file-rss:1188kB
[28039564.751009] Out of memory (oom_kill_allocating_task): Kill process 148455 (java) score 0 or sacrifice child
[28039564.761550] Killed process 148455, UID 604, (java) total-vm:5853636kB, anon-rss:4152656kB, file-rss:788kB
[28039564.774479] Out of memory (oom_kill_allocating_task): Kill process 162416 (DragoonAgent) score 0 or sacrifice child
[28039564.785651] Killed process 162416, UID 0, (DragoonAgent) total-vm:34128kB, anon-rss:1564kB, file-rss:688kB
[28039564.803555] Out of memory (oom_kill_allocating_task): Kill process 151483 (nginx-proxy) score 0 or sacrifice child
[28039564.820032] Killed process 151483, UID 604, (nginx-proxy) total-vm:41308kB, anon-rss:10304kB, file-rss:368kB
[28039564.830798] Out of memory (oom_kill_allocating_task): Kill process 73465 (java) score 0 or sacrifice child
[28039564.841212] Killed process 73465, UID 602, (java) total-vm:6792364kB, anon-rss:5370312kB, file-rss:1300kB
[28039564.871106] Out of memory (oom_kill_allocating_task): Kill process 235358 (java) score 0 or sacrifice child
[28039564.881595] Killed process 235358, UID 560, (java) total-vm:5281704kB, anon-rss:4475344kB, file-rss:976kB
[28039564.893091] Out of memory (oom_kill_allocating_task): Kill process 32944 (java) score 0 or sacrifice child
[28039564.903560] Killed process 32944, UID 514, (java) total-vm:824068kB, anon-rss:279292kB, file-rss:776kB
[28039564.920034] java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
[28039564.960237]  [<ffffffff81118898>] ? oom_kill_process+0x68/0x140
[28039573.086197] Out of memory (oom_kill_allocating_task): Kill process 151443 (java) score 0 or sacrifice child
[28039573.096928] Killed process 151443, UID 640, (java) total-vm:9150608kB, anon-rss:6350316kB, file-rss:1268kB
[28039573.112782] java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
[28039573.112812]  [<ffffffff81118898>] ? oom_kill_process+0x68/0x140
[28039573.471853] Out of memory (oom_kill_allocating_task): Kill process 154317 (java) score 0 or sacrifice child
[28039573.471893] Killed process 154317, UID 521, (java) total-vm:1965748kB, anon-rss:1179020kB, file-rss:780kB
[28039573.486341] java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
[28039573.486384]  [<ffffffff81118898>] ? oom_kill_process+0x68/0x140
[28039573.866900] Out of memory (oom_kill_allocating_task): Kill process 89153 (java) score 0 or sacrifice child
[28039573.866996] Killed process 89153, UID 605, (java) total-vm:5698936kB, anon-rss:3969240kB, file-rss:1292kB
[28039573.871172] java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
[28039573.871203]  [<ffffffff81118898>] ? oom_kill_process+0x68/0x140
[28039573.871272]  [<ffffffff81191f0a>] ? d_kill+0x4a/0x60
[28039574.251116] Out of memory (oom_kill_allocating_task): Kill process 253281 (java) score 0 or sacrifice child
[28039574.251179] Killed process 253281, UID 632, (java) total-vm:6947368kB, anon-rss:5906500kB, file-rss:1480kB
[28039595.341866] java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
[28039595.341901]  [<ffffffff81118898>] ? oom_kill_process+0x68/0x140
[28039595.721345] Out of memory (oom_kill_allocating_task): Kill process 58574 (java) score 0 or sacrifice child
[28039595.721360] Killed process 58574, UID 564, (java) total-vm:2768428kB, anon-rss:142720kB, file-rss:1020kB
[28039602.712141] DragoonAgent invoked oom-killer: gfp_mask=0xd0, order=1, oom_adj=0, oom_score_adj=0
[28039602.769602]  [<ffffffff81118898>] ? oom_kill_process+0x68/0x140
[28039610.554094] Out of memory (oom_kill_allocating_task): Kill process 37272 (DragoonAgent) score 0 or sacrifice child
[28039610.565138] Killed process 37272, UID 0, (DragoonAgent) total-vm:203204kB, anon-rss:792kB, file-rss:740kB

目标机上留下了如此之多的痕迹,却让我熟手无策,之前曾经怀疑forest于01:30分前后的全量200MB+数据量的同步会触发上帝之手,亦或tt、crontab在凌晨的任务所致,但由于这里dmesg的时间不准确,无法说明问题。

dmesg时间转化,log实际时间=格林威治1970-01-01+(当前时间秒数-系统启动至今的秒数+dmesg打印的log时间)秒数: date -d "1970-01-01 UTC `echo "$(date +%s)-$(cat /proc/uptime|cut -f 1 -d ' ')+28039610.565138"|bc ` seconds"

这样得到的是 Fri Feb 12 04:33:27 CST 2016,果然是上帝之手,捕获的异常日志尽然是个未来时间。(经狄卢回复提示,这个dmesg的时间是宿主机的时间,当时上宿主机查了下日志时间已经是6点多了,dmesg的日志缓存被挤出了,查/var/log/message.*日志由于没有killer前后的进程信息,所以还是无从下手。)

再次没了头绪,To Be Continue...

等等,PE帮助重启了tomcat,在下一次上帝之手来临之前,将原来她留下的痕迹抹去,以便下次能抓到她新留下的痕迹。

sudo dmesg -c > dmesg-20150419.log

加上-c参数,输出系统log并清除缓存。

下次就可以来抓个正着了。 查看OOM Killer情况: pgm -A -b -p 5 -f temaihost 'dmesg|grep -i kill' > oomkiller.log

罪魁祸首之一--此为人为,非上面的问题

2015-04-22 正在愉快的写着代码的我,突然收到告警

10:56:57 [C] xxxhost[ET2] xxx041.n.et2 check_http (time:0 state:2)http健康检查不通过

小样~,抓到你了吧!

赶紧上机器: sudo dmesg > dmesg-20150422.log

在java进程被oom-killer时,前后的进程信息都打印出来如下:

java进程还在时:

[28325845.670019] 822171 total pagecache pages
[28325845.674606] 63824 pages in swap cache
[28325845.678955] Swap cache stats: add 2143064, delete 2079240, find 5919502321/5919637151
[28325845.687515] Free swap  = 0kB
[28325845.690894] Total swap = 1999864kB
[28325846.091380] 33554416 pages RAM
[28325846.094942] 487905 pages reserved
[28325846.098748] 1067270 pages shared
[28325846.102469] 32095323 pages non-shared
[28325846.106620] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
[28325846.114817] [235863]     0 235863     2591      137  11       0             0 init
[28325846.123107] [235932]     0 235932    15662      156   9       0             0 sshd
[28325846.131398] [235955]     0 235955    18701      231  12       0             0 crond
[28325846.139772] [235965]     0 235965     5199       83   9       0             0 atd
[28325846.147976] [235988]     0 235988      952       99  22       0             0 mingetty
[28325846.156797] [39458]     0 39458    12836       63   7       0             0 syslog-ng
[28325846.165340] [39459]     0 39459    13752      731   9       0             0 syslog-ng
[28325846.173889] [40111]     0 40111     7666      188  12       0             0 DragoonAgent
[28325846.182708] [40112]     0 40112     9019      589   9       0             0 DragoonAgent
[28325846.191513] [40113]     0 40113    43101      302  15       0             0 DragoonAgent
[28325846.200310] [40114]     0 40114    53539      639  16       0             0 DragoonAgent
[28325846.209131] [24891]     0 24891    22534      368  30       0             0 sshd
[28325846.217240] [24894] 55218 24894    23163      278  30       0             0 sshd
[28325846.225356] [24895] 55218 24895    16323      132  29       0             0 grep
[28325846.233482] [116246]     0 116246   106058     7228  19       0             0 DragoonAgent
[28325846.242462] [247609]     0 247609    22534      544  28       0             0 sshd
[28325846.250758] [247618] 55218 247618    23339     1147  27       0             0 sshd
[28325846.259044] [247619] 55218 247619    16323      157  25       0             0 grep
[28325846.267329] [251546]     0 251546    22534      545  28       0             0 sshd
[28325846.275619] [251556] 55218 251556    23555     1369  25       0             0 sshd
[28325846.283908] [251557] 55218 251557    16323      158  25       0             0 grep
[28325846.292268] [181412]   514 181412   193295    31909  14       0             0 java
[28325846.305998] [185516]   541 185516  1737562  1402469  19       0             0 java
[28325846.314269] [196262]   541 196262    51229    16833  17       0             0 nginx-proxy
[28325846.323158] [196263]   541 196263     1997      137  11       0             0 cronolog
[28325846.331800] [196264]   541 196264     1997      123  11       0             0 cronolog
[28325846.340429] [196265]   541 196265     1997      137  12       0             0 cronolog
[28325846.349068] [196266]   541 196266    51229    19013  10       0             0 nginx-proxy
[28325846.357953] [196267]   541 196267    51229    19014   9       0             0 nginx-proxy
[28325846.366851] [196269]   541 196269    51261    19015  11       0             0 nginx-proxy
[28325846.375746] [196270]   541 196270    51261    19021  11       0             0 nginx-proxy
[28325846.384644] [196271]   541 196271    51229    19976  11       0             0 nginx-proxy
[28325846.393554] [ 9365]   514  9365    25502      361  10       0             0 crond
[28325846.401754] [ 9399]   514  9399     2686      169  10       0             0 sh
[28325846.409692] [ 9404]   514  9404     2687      170  10       0             0 runLogAgent.sh
[28325846.418686] [ 9424]   514  9424     2688      201  10       0             0 logAgent4.sh
[28325846.427499] [ 9460]   514  9460     1981      119   9       0             0 sleep
[28325846.435712] [17896]     0 17896    22534      816   9       0             0 sshd
[28325846.443817] [17906] 52539 17906    22534      431  12       0             0 sshd
[28325846.451930] [17907] 52539 17907    17022      262  12       0             0 bash
[28325846.460086] [30991] 52539 30991   622789   603903   9       0             0 view
[28325846.468200] Memory cgroup out of memory: Kill process 185516 (java) score 670 or sacrifice child
[28325846.477756] Killed process 185516, UID 541, (java) total-vm:6950248kB, anon-rss:5608252kB, file-rss:1624kB
[28325870.440923] rsync invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
[28325870.449785] rsync cpuset=qing010176013041.n.et2 mems_allowed=0
[28325870.456138] Pid: 33825, comm: rsync Not tainted 2.6.32-358.23.2.ali1172.el5.x86_64 #1

java进程被kill后:

[28325871.829088] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
[28325871.837299] [235863]     0 235863     2591      137  11       0             0 init
[28325871.845595] [235932]     0 235932    15662      156   9       0             0 sshd
[28325871.853881] [235955]     0 235955    18701      231  12       0             0 crond
[28325871.862245] [235965]     0 235965     5199       83   9       0             0 atd
[28325871.870440] [235988]     0 235988      952       99  22       0             0 mingetty
[28325871.879256] [39458]     0 39458    12836       63   7       0             0 syslog-ng
[28325871.887809] [39459]     0 39459    13752      731  12       0             0 syslog-ng
[28325871.896366] [40111]     0 40111     7666      184  12       0             0 DragoonAgent
[28325871.905169] [40112]     0 40112     9019      589  11       0             0 DragoonAgent
[28325871.913996] [40113]     0 40113    43101      298  15       0             0 DragoonAgent
[28325871.923171] [40114]     0 40114    53539      636  16       0             0 DragoonAgent
[28325871.931994] [24891]     0 24891    22534      368  30       0             0 sshd
[28325871.940118] [24894] 55218 24894    23163      278  30       0             0 sshd
[28325871.948242] [24895] 55218 24895    16323      132  29       0             0 grep
[28325871.956388] [116246]     0 116246   106058     7227  19       0             0 DragoonAgent
[28325871.965368] [247609]     0 247609    22534      544  28       0             0 sshd
[28325871.973664] [247618] 55218 247618    23339     1147  27       0             0 sshd
[28325871.981954] [247619] 55218 247619    16323      157  25       0             0 grep
[28325871.990242] [251546]     0 251546    22534      545  28       0             0 sshd
[28325871.998530] [251556] 55218 251556    23555     1369  25       0             0 sshd
[28325872.006835] [251557] 55218 251557    16323      158  25       0             0 grep
[28325872.015186] [181412]   514 181412   193295    31891  14       0             0 java
[28325872.023490] [196262]   541 196262    51229    16833  17       0             0 nginx-proxy
[28325872.032380] [196263]   541 196263     1997      137  10       0             0 cronolog
[28325872.041012] [196264]   541 196264     1997      123  11       0             0 cronolog
[28325872.049646] [196265]   541 196265     1997      137  10       0             0 cronolog
[28325872.058285] [196266]   541 196266    51229    18961  10       0             0 nginx-proxy
[28325872.067181] [196267]   541 196267    51229    18961  11       0             0 nginx-proxy
[28325872.076074] [196269]   541 196269    51261    18963  10       0             0 nginx-proxy
[28325872.084986] [196270]   541 196270    51261    18963  10       0             0 nginx-proxy
[28325872.093884] [196271]   541 196271    51229    19972   9       0             0 nginx-proxy
[28325872.102807] [17896]     0 17896    22534      816   9       0             0 sshd
[28325872.110920] [17906] 52539 17906    22534      431  12       0             0 sshd
[28325872.119035] [17907] 52539 17907    17022      262  12       0             0 bash
[28325872.127178] [30991] 52539 30991  2024682  2005803  12       0             0 view
[28325872.135297] [33786]     0 33786    22534      809  11       0             0 sshd
[28325872.143412] [33789]   541 33789    22534      423  12       0             0 sshd
[28325872.151535] [33825]   541 33825    18391      345   9       0             0 rsync
[28325872.159731] [33851]   541 33851    22680      325  12       0             0 rsync
[28325872.167942] Memory cgroup out of memory: Kill process 30991 (view) score 958 or sacrifice child
[28325872.177364] Killed process 30991, UID 52539, (view) total-vm:8098728kB, anon-rss:8022516kB, file-rss:696kB

可以看到view命令在使用时吃了2G+的内存,查看进程被谁使用:

ps -ef|grep 52539

某同学在用view命令查看大日志。

罪魁祸首之二--元凶

时间:2015-04-30 02:23:47 [C] xxxhost[ET2] xxx040.n.et2 check_http (state:2)http健康检查不通过。 本次被oomkiller的时间点很接近上一次(一周前2015-04-19 02:57:58)。

Killer前dmesg:

[28988811.909208] [174621]   596 174621    50744    18795  30       0             0 nginx-proxy
[28988811.919601] [174622]   596 174622    50744    18794  30       0             0 nginx-proxy
[28988811.928551] [174623]   596 174623    50712    19823  30       0             0 nginx-proxy
[28988811.937460] [125332]   699 125332  1810146  1592616  13       0             0 java
[28988811.945792] [140303]   699 140303    16019      594  13       0             0 nginx-proxy
[28988811.954698] [140305]   699 140305     1997       27  12       0             0 cronolog
[28988811.963368] [140307]   699 140307     1998       29  13       0             0 cronolog
[28988811.971971] [140308]   699 140308    18293     2871  13       0             0 nginx-proxy
[28988811.980857] [140309]   699 140309    18262     2839  13       0             0 nginx-proxy
[28988811.989746] [140310]   699 140310    18295     2850  13       0             0 nginx-proxy
[28988811.998653] [140312]   699 140312    18295     2888  13       0             0 nginx-proxy
[28988812.007562] [73740]   540 73740  1736497  1334233  27       0             0 java
[28988812.015688] [84304]   540 84304    51227    16790  24       0             0 nginx-proxy
[28988812.024414] [84305]   540 84305     1997       35  25       0             0 cronolog
[28988812.038079] [84306]   540 84306     1996       26  25       0             0 cronolog
[28988812.046545] [84308]   540 84308     1997       39  25       0             0 cronolog
[28988812.055016] [84309]   540 84309    51227    18837  25       0             0 nginx-proxy
[28988812.063744] [84310]   540 84310    51259    18839  25       0             0 nginx-proxy
[28988812.072457] [84311]   540 84311    51227    18837  25       0             0 nginx-proxy
[28988812.081190] [84312]   540 84312    51227    18837  25       0             0 nginx-proxy
[28988812.089885] [84313]   540 84313    51227    19867  25       0             0 nginx-proxy
[28988812.098762] [211578]   658 211578    16517       50  22       0             0 run.sh
[28988812.107223] [211598]   658 211598  2088842  1754909  21       0             0 java
[28988812.115517] [223499]   658 223499    15505      593  23       0             0 nginx-proxy
[28988812.124389] [223500]   658 223500      967       23  21       0             0 cronolog
[28988812.133046] [223501]   658 223501      965       22  22       0             0 cronolog
[28988812.141732] [223503]   658 223503    17714     2765  22       0             0 nginx-proxy
[28988812.150646] [223504]   658 223504    17714     2770  22       0             0 nginx-proxy
[28988812.159514] [223505]   658 223505    17548     2691  22       0             0 nginx-proxy
[28988812.168409] [223506]   658 223506    17751     2862  22       0             0 nginx-proxy
[28988812.177280] [53645] 59844 53645   127346    27078  30       0             0 java
[28988812.185389] [64449]   610 64449  1381615   739559  18       0             0 java
[28988812.193480] [48158]   514 48158   202009    50494  29       0             0 java
[28988812.201641] [109088]   641 109088   147468    22034   8       0             0 java
[28988812.209905] [185611]   536 185611    16517       51  28       0             0 run.sh
[28988812.218387] [185622]   536 185622  1475070   969956  30       0             0 java
[28988812.226671] [199505]   536 199505    15513      589  29       0             0 nginx-proxy
[28988812.235596] [199506]   536 199506      967       21  31       0             0 cronolog
[28988812.244239] [199507]   536 199507      967       22  31       0             0 cronolog
[28988812.252900] [199508]   536 199508    17556     2631  28       0             0 nginx-proxy
[28988812.261791] [199509]   536 199509    17556     2630  29       0             0 nginx-proxy
[28988812.270698] [199510]   536 199510    17556     2631  28       0             0 nginx-proxy
[28988812.279600] [199511]   536 199511    17556     2631  28       0             0 nginx-proxy
[28988812.288506] [196694]   514 196694    25503      132  16       0             0 crond
[28988812.296921] [196709]   514 196709     2686       39  16       0             0 sh
[28988812.305185] [196711]   514 196711     2687       42  16       0             0 runLogAgent.sh
[28988812.314379] [196724]   514 196724     2688       70  18       0             0 logAgent4.sh
[28988812.323359] [204084]   514 204084    25501      133  31       0             0 crond
[28988812.331754] [204085]   514 204085     2686       40  28       0             0 sh
[28988812.339894] [204094]   514 204094     2687       42  29       0             0 runLogAgent.sh
[28988812.349051] [204132]   514 204132     2688       63  28       0             0 logAgent4.sh
[28988812.358005] [204156]   514 204156     1981       21  31       0             0 sleep
[28988812.366381] [204229]   514 204229    25503      135  13       0             0 crond
[28988812.374746] [204246]   514 204246     2171       37  13       0             0 sh
[28988812.382848] [204365]   514 204365     2172       40  13       0             0 runLogAgent.sh
[28988812.392000] [204426]   623 204426    25502       75  29       0             0 crond
[28988812.400433] [204459]   623 204459     2179       37  29       0             0 sh
[28988812.408570] [204466]   623 204466     2180       39  29       0             0 runLogAgent.sh
[28988812.417705] [205245]   514 205245    26016      133  14       0             0 crond
[28988812.426061] [205250]   514 205250     2686       39  13       0             0 sh
[28988812.434217] [205276]   514 205276     2687       40  13       0             0 runLogAgent.sh
[28988812.443370] [205351]   514 205351     2688       62  15       0             0 logAgent4.sh
[28988812.452354] [205406]   514 205406     1981       23  13       0             0 sleep
[28988812.460715] [208173]   514 208173     2172       66  12       0             0 logAgent4.sh
[28988812.469694] [208203]   514 208203      951       18  12       0             0 sleep
[28988812.478085] [211254]   514 211254    25503      134  13       0             0 crond
[28988812.486455] [211281]   514 211281     2179       36  14       0             0 sh
[28988812.494562] [211286]   514 211286     2180       39  15       0             0 runLogAgent.sh
[28988812.503695] [211340]   514 211340     2181       60  14       0             0 logAgent4.sh
[28988812.512803] [211378]   514 211378      951       18  13       0             0 sleep
[28988812.521197] [211623]   514 211623    25503       76  29       0             0 crond
[28988812.529576] [211666]   514 211666     2686       39  30       0             0 sh
[28988812.537710] [211676]   514 211676     2687       41  31       0             0 runLogAgent.sh
[28988812.546884] [211718]   641 211718    25503       75   9       0             0 crond
[28988812.555276] [211742]   514 211742     2688       64  28       0             0 logAgent4.sh
[28988812.564284] [211756]   641 211756     2171       38  11       0             0 sh
[28988812.572381] [211794]   641 211794     2172       39   9       0             0 runLogAgent.sh
[28988812.581528] [211859]   641 211859     2172       66  10       0             0 logAgent4.sh
[28988812.590490] [211890]   514 211890     1981       22  28       0             0 sleep
[28988812.598895] [211909]   514 211909    25503      127  10       0             0 crond
[28988812.607273] [211939]   514 211939     2686       40  11       0             0 sh
[28988812.615392] [211944]   514 211944     2687       40  11       0             0 runLogAgent.sh
[28988812.624527] [212045]   514 212045     2688       63  11       0             0 logAgent4.sh
[28988812.633559] [212108]   514 212108     1981       23   9       0             0 sleep
[28988812.641973] [212188]   514 212188    25503      134   9       0             0 crond
[28988812.650362] [212194]   514 212194    25503       76  30       0             0 crond
[28988812.658764] [212199]   514 212199     2171       37  11       0             0 sh
[28988812.666915] [212200]   514 212200     2172       40   9       0             0 runLogAgent.sh
[28988812.676083] [212232]   514 212232     2686       38  31       0             0 sh
[28988812.684200] [212235]   514 212235     2687       42  30       0             0 runLogAgent.sh
[28988812.693359] [212257]   514 212257    25503      134  13       0             0 crond
[28988812.701757] [212264]   667 212264    25503       75  21       0             0 crond
[28988812.710118] [212382]   667 212382     2179       37  21       0             0 sh
[28988812.718344] [212383]   667 212383     2180       38  21       0             0 runLogAgent.sh
[28988812.727505] [212620]   667 212620     2180       58  20       0             0 logAgent4.sh
[28988812.736474] [212650]   667 212650      951       17  23       0             0 sleep
[28988812.744919] [212673]   514 212673     2686       39  13       0             0 sh
[28988812.753086] [212681]   514 212681     2687       41  15       0             0 runLogAgent.sh
[28988812.762216] [212713]   514 212713     2688       64  30       0             0 logAgent4.sh
[28988812.771192] [212729]   514 212729     2688       63  15       0             0 logAgent4.sh
[28988812.780182] [212747]   514 212747     1981       22  30       0             0 sleep
[28988812.788534] [212816]   641 212816      951       18   8       0             0 sleep
[28988812.796902] [215245]   514 215245     2172       66   9       0             0 logAgent4.sh
[28988812.805908] [215276]   514 215276      951       17  11       0             0 sleep
[28988812.819553] [217980]   937 217980    25502       74  23       0             0 crond
[28988812.827902] [218003]   937 218003     2686       39  20       0             0 sh
[28988812.836014] [218008]   937 218008     2687       41  23       0             0 runLogAgent.sh
[28988812.845165] [218031]   937 218031     2687       69  20       0             0 logAgent4.sh
[28988812.854175] [218161]   937 218161     1981       22  21       0             0 sleep
[28988812.862635] [218222]   514 218222    25502      134   5       0             0 crond
[28988812.871017] [218248]   514 218248     2171       37   6       0             0 sh
[28988812.879123] [218254]   514 218254     2172       40   7       0             0 runLogAgent.sh
[28988812.888285] [218347]   514 218347     2173       64   4       0             0 logAgent4.sh
[28988812.897252] [218372]   514 218372    25503      134   0       0             0 crond
[28988812.905684] [218452]   514 218452     2171       38   1       0             0 sh
[28988812.913813] [218461]   514 218461     2172       41   2       0             0 runLogAgent.sh
[28988812.923076] [218525]   514 218525    26016      133  23       0             0 crond
[28988812.931432] [218573]   514 218573     2686       39  23       0             0 sh
[28988812.939539] [218585]   514 218585     2687       42  20       0             0 runLogAgent.sh
[28988812.948701] [218632]   681 218632    25502      132  20       0             0 crond
[28988812.957101] [218666]   681 218666     2686       39  20       0             0 sh
[28988812.965204] [218704]   681 218704     2687       41  23       0             0 runLogAgent.sh
[28988812.974378] [218707]   514 218707     2688       63  20       0             0 logAgent4.sh
[28988812.983399] [218859]   681 218859     2687       68  23       0             0 logAgent4.sh
[28988812.992376] [219068]   514 219068     1981       23  21       0             0 sleep
[28988813.000742] [219231]   681 219231     1981       22  22       0             0 sleep
[28988813.009129] [224917]    99 224917    32859      917  28       0             0 python
[28988813.017561] [224924]     0 224924    25502       79  20       0             0 crond
[28988813.025971] [224925]     0 224925    25502       79  20       0             0 crond
[28988813.034314] [224926]   667 224926    25502       79  20       0             0 crond
[28988813.042691] [224927]     0 224927    25501      139   6       0             0 crond
[28988813.051093] [224928]     0 224928    25501      139   5       0             0 crond
[28988813.059528] [224929]     0 224929    25501      138   5       0             0 crond
[28988813.067936] [224930]     0 224930    25501      138   6       0             0 crond
[28988813.076318] [224931]   514 224931    25501      138   7       0             0 crond
[28988813.084698] [224934]     0 224934    25501       78  23       0             0 crond
[28988813.093085] [224935]     0 224935    25501       78  21       0             0 crond
[28988813.101452] [224936]   937 224936    25501       78  21       0             0 crond
[28988813.109827] [224937]     0 224937    25502      114  31       0             0 crond
[28988813.118233] [224938]     0 224938    25502      114  31       0             0 crond
[28988813.126691] [224939]   514 224939    25502      114  31       0             0 crond
[28988813.135074] [224940]     0 224940    25501       79  10       0             0 crond
[28988813.143422] [224941]     0 224941    25501       79  10       0             0 crond
[28988813.151820] [224942]     0 224942    25501       79  10       0             0 crond
[28988813.160183] [224943]     0 224943    25501       79  10       0             0 crond
[28988813.168561] [224944]   514 224944    25501       79  10       0             0 crond
[28988813.176931] [224945]     0 224945    25502      138  13       0             0 crond
[28988813.185360] [224946]     0 224946    25502      138  13       0             0 crond
[28988813.193752] [224947]   514 224947    25502      138  13       0             0 crond
[28988813.202114] [224948]     0 224948    25502      138   1       0             0 crond
[28988813.210479] [224949]     0 224949    25502      138   1       0             0 crond
[28988813.218860] [224950]     0 224950    25502      138   3       0             0 crond
[28988813.227209] [224951]     0 224951    25502      138   3       0             0 crond
[28988813.235575] [224952]   514 224952    25502      138   3       0             0 crond
[28988813.243967] [224955]     0 224955    25501       78  24       0             0 crond
[28988813.252358] [224956]     0 224956    25501       78  24       0             0 crond
[28988813.260751] [224957]     0 224957    25502      139  15       0             0 crond
[28988813.269117] [224958]     0 224958    25502      139  14       0             0 crond
[28988813.277500] [224959]     0 224959    25502      139  15       0             0 crond
[28988813.285865] [224960]     0 224960    25502      139  15       0             0 crond
[28988813.294243] [224961]   514 224961    25502      139  15       0             0 crond
[28988813.302616] [224962]    99 224962    19979       93  13       0             0 tsar_tmd
[28988813.311311] [224963]     0 224963    25502      136  17       0             0 crond
[28988813.319681] [224964]     0 224964    25502      136  17       0             0 crond
[28988813.328136] [224965]     0 224965    25502      136  17       0             0 crond
[28988813.336561] [224966]     0 224966    25502      136  17       0             0 crond
[28988813.344937] [224967]   514 224967    25502      136  17       0             0 crond
[28988813.353329] [224968]     0 224968    26015      137  21       0             0 crond
[28988813.361712] [224969]     0 224969    26015      137  21       0             0 crond
[28988813.370080] [224970]     0 224970    26015      137  21       0             0 crond
[28988813.378458] [224971]     0 224971    26015      137  21       0             0 crond
[28988813.386846] [224972]   514 224972    26015      137  21       0             0 crond
[28988813.395219] [224974]    99 224974    32324      921  15       0             0 python
[28988813.403712] [224975]     0 224975    25503       80   0       0             0 crond
[28988813.412093] [224977]    99 224977    32345      911  15       0             0 python
[28988813.420593] [224979]     0 224979    25500      137  29       0             0 crond
[28988813.428990] [224980]     0 224980    25500      137  29       0             0 crond
[28988813.437351] [224981] 59844 224981    25500      137  29       0             0 crond
[28988813.445745] [224982] 59844 224982    25500      137  29       0             0 crond
[28988813.454134] [224983]   514 224983    25500      137  29       0             0 crond
[28988813.462533] [224986]     0 224986    25502       80  28       0             0 crond
[28988813.470884] [224987]     0 224987    25502       80  28       0             0 crond
[28988813.479230] [224988]     0 224988    25502       80  28       0             0 crond
[28988813.487609] [224989]     0 224989    25502       80  28       0             0 crond
[28988813.495967] [224990]   514 224990    25502       80  28       0             0 crond
[28988813.504340] [224991]     0 224991    25501       77  11       0             0 crond
[28988813.512738] [224992]     0 224992    25501       77  11       0             0 crond
[28988813.521106] [224993]     0 224993    25502       79  10       0             0 crond
[28988813.529560] [224994]     0 224994    25502       79  10       0             0 crond
[28988813.537961] [224995]   641 224995    25502       79  10       0             0 crond
[28988813.546365] [224997]     0 224997    25502      138  10       0             0 crond
[28988813.554735] [224998]     0 224998    25502      139  14       0             0 crond
[28988813.563132] [224999]     0 224999    25502      139  14       0             0 crond
[28988813.571539] [225000]     0 225000    25502      139  14       0             0 crond
[28988813.585083] [225001]     0 225001    25502      139  14       0             0 crond
[28988813.593479] [225002]   514 225002    25502      139  14       0             0 crond
[28988813.601842] [225003]     0 225003    25501      136  25       0             0 crond
[28988813.610232] [225004]     0 225004    25501      136  26       0             0 crond
[28988813.618656] [225005]   514 225005    25501      136  24       0             0 crond
[28988813.627051] [225006]     0 225006    25501      132  19       0             0 crond
[28988813.635434] [225007]     0 225007    25501      132  19       0             0 crond
[28988813.643831] [225008]     0 225008    25501      132  19       0             0 crond
[28988813.652221] [225009]     0 225009    25501      133  19       0             0 crond
[28988813.660619] [225011]   514 225011    25501      132  19       0             0 crond
[28988813.669038] [225012]     0 225012    25501      136  22       0             0 crond
[28988813.677387] [225013]     0 225013    25501      136  22       0             0 crond
[28988813.685755] [225014]   681 225014    25501      136  22       0             0 crond
[28988813.694117] [225015]     0 225015    25501      137  25       0             0 crond
[28988813.702486] [225016]     0 225016    25501      137  25       0             0 crond
[28988813.710861] [225017]     0 225017    25501      137  25       0             0 crond
[28988813.719240] [225018]     0 225018    25501      137  25       0             0 crond
[28988813.727622] [225019]   514 225019    25501      137  25       0             0 crond
[28988813.736081] [225020]     0 225020    26015      137  12       0             0 crond
[28988813.744457] [225021]     0 225021    26015      137  12       0             0 crond
[28988813.752856] [225022]     0 225022    26015      137  12       0             0 crond
[28988813.761207] [225023]     0 225023    26015      137  12       0             0 crond
[28988813.769596] [225024]     0 225024    26015      137  12       0             0 crond
[28988813.777991] [225025]   514 225025    26015      137  12       0             0 crond
[28988813.786365] [225026]     0 225026    25502      138  14       0             0 crond
[28988813.794751] [225027]     0 225027    25502      138  14       0             0 crond
[28988813.803185] [225028]     0 225028    25502      138  14       0             0 crond
[28988813.811606] [225029]     0 225029    25502      138  14       0             0 crond
[28988813.819986] [225031]   514 225031    25502      138  14       0             0 crond
[28988813.828378] [225032]     0 225032    25502       80  31       0             0 crond
[28988813.836756] [225033]     0 225033    25502       80  31       0             0 crond
[28988813.845120] [225034]     0 225034    25502       80  31       0             0 crond
[28988813.853528] [225035]     0 225035    25502       80  31       0             0 crond
[28988813.861892] [225036]   514 225036    25502       80  31       0             0 crond
[28988813.870302] [225037]    99 225037    32859      915  17       0             0 python
[28988813.878741] [225039]     0 225039    25502       80  23       0             0 crond
[28988813.884719] __ratelimit: 19122 callbacks suppressed
[28988813.884721] sshd: page allocation failure. order:0, mode:0x20
[28988813.884724] Pid: 73067, comm: sshd Tainted: G        W  ---------------    2.6.32-358.23.2.ali1172.el5.x86_64 #1

Killer后dmesg:

[28988816.512969] [225040]     0 225040    25502       80  23       0             0 crond
[28988816.526580] [225041]   514 225041    25502       80  23       0             0 crond
[28988816.534980] [225042]    99 225042    31685      767  24       0             0 python
[28988816.543461] [225047]     0 225047    25502       81  29       0             0 crond
[28988816.551859] [225053]     0 225053    16481       43   0       0             0 sh
[28988816.559968] [225054]    99 225054    30635      744  15       0             0 python
[28988816.568468] [225060] 59844 225060     2686       43  30       0             0 bash
[28988816.576747] [225068]     0 225068    16481       48  14       0             0 sh
[28988816.584859] [225069]     0 225069    16465       42  14       0             0 sh
[28988816.592963] [225093]    99 225093    16481       45  22       0             0 tsar_check_io
[28988816.602030] [225096] 120063 225096    16473       50   9       0             0 disk_rw
[28988816.610678] [225099]     0 225099    16481       45  22       0             0 sh
[28988816.618837] [225100]     0 225100    25503      132   9       0             0 crond
[28988816.627209] [225102]     0 225102    25502      131   9       0             0 crond
[28988816.635584] [225104]     0 225104    25502      131   9       0             0 crond
[28988816.643983] [225105]     0 225105    25503      132   9       0             0 crond
[28988816.652356] [225106]   514 225106    25502      131   9       0             0 crond
[28988816.660709] [225109] 59844 225109     2686       49  30       0             0 bash
[28988816.668990] [225131]    99 225131    30634      751   0       0             0 python
[28988816.682656] [225142]     0 225142     2179       40   0       0             0 sh
[28988816.690788] [225144]     0 225144     1995       30  11       0             0 sadc
[28988816.699098] [225145]     0 225145     1567      113   2       0             0 tsar
[28988816.707553] [225147]     0 225147     2686       47   9       0             0 sh
[28988816.715704] [225154]    99 225154    15950       49   4       0             0 tsar_check_io
[28988816.724828] [225155]   514 225155     2688       70  17       0             0 logAgent4.sh
[28988816.733853] [225156]    99 225156    16465       51  29       0             0 tsar_check_io
[28988816.742936] [225157]     0 225157     2179       46  16       0             0 sh
[28988816.751027] [225158]     0 225158     2079      117   9       0             0 tsar
[28988816.759322] [225160] 59844 225160     2686       43  30       0             0 bash
[28988816.767658] [225161]    99 225161     1566      116  20       0             0 tsar
[28988816.775955] [225166]     0 225166    15653      140  18       0             0 sshd
[28988816.784227] [225167]    99 225167     1565      115   5       0             0 tsar
[28988816.792484] [225168]    99 225168     4048       96   6       0             0 python
[28988816.800952] [225170]    99 225170    16481       45  15       0             0 tsar_check_io
[28988816.810004] [225172]     0 225172     1673       14   3       0             0 ls
[28988816.818106] [225173]     0 225173    69547      631   5       0             0 DragoonAgent
[28988816.827109] [225174]     0 225174    50395      337  27       0             0 DragoonAgent
[28988816.836075] [225175]    99 225175    16465       41  29       0             0 tsar_check_io
[28988816.845125] [225176]     0 225176    69316      291  29       0             0 DragoonAgent
[28988816.854168] [225177]     0 225177    16176       32  14       0             0 sshd
[28988816.862443] [225179]     0 225179    25501      131  17       0             0 crond
[28988816.870801] [225180]    99 225180     1565      115  21       0             0 tsar
[28988816.879100] [225181] 59844 225181     2686       43  29       0             0 bash
[28988816.887367] [225182]     0 225182    16481       46  13       0             0 sh
[28988816.895475] [225183]     0 225183    25501      137   5       0             0 crond
[28988816.903848] [225185]     0 225185    25501      136   4       0             0 crond
[28988816.912347] [225187]   514 225187    25502      135   2       0             0 crond
[28988816.920765] Out of memory (oom_kill_allocating_task): Kill process 194223 (java) score 0 or sacrifice child

Killer后top命令:

top - 09:44:49 up 278 days, 19:17,  2 users,  load average: 0.45, 0.69, 0.44
Tasks:  46 total,   1 running,  45 sleeping,   0 stopped,   0 zombie
Cpu(s): 14.4%us,  2.1%sy,  0.0%ni, 67.4%id,  0.0%wa,  0.0%hi,  0.3%si, 15.9%st
Mem:   8388608k total,  4081276k used,  4307332k free,        0k buffers
Swap:  1999864k total,    19536k used,  1980328k free,   630644k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                        
 32420 admin     20   0 6553m 3.1g  32m S 58.9 38.2   5:39.57 java                                                                                                                                                                           
 33090 admin     20   0  200m  74m 9140 S  2.7  0.9   0:06.08 nginx-proxy                                                                                                                                                                    
 33092 admin     20   0  200m  74m 9128 S  2.7  0.9   0:07.18 nginx-proxy                                                                                                                                                                    
 33093 admin     20   0  200m  74m 9184 S  1.3  0.9   0:06.80 nginx-proxy                                                                                                                                                                    
 33094 admin     20   0  200m  74m 9160 S  1.0  0.9   0:06.21 nginx-proxy                                                                                                                                                                    
222661 timetunn  20   0  735m 123m  12m S  1.0  1.5   1:14.58 java                                                                                                                                                                           
 33089 admin     20   0  7988  520  416 S  0.3  0.0   0:01.40 cronolog                                                                                                                                                                       
 34253 ben.xion  20   0 14780 1076  836 R  0.3  0.0   0:00.01 top                                                                                                                                                                            
146627 root      20   0  360m  34m  676 S  0.3  0.4 222:53.53 DragoonAgent                                                                                                                                                                   
259333 root      20   0  272m 3380 1040 S  0.3  0.0 129:38.90 DragoonAgent                                                                                                                                                                   
     1 root      20   0 10364  280  228 S  0.0  0.0   0:34.87 init                                                                                                                                                                           
    68 root      20   0 62648  300  200 S  0.0  0.0   1:11.84 sshd                                                                                                                                                                           
    91 root      20   0 74804  504  124 S  0.0  0.0   8:43.53 crond                                                                                                                                                                          
   101 root      20   0 20796  176  124 S  0.0  0.0   0:04.46 atd                                                                                                                                                                            
   124 root      20   0  3808    8    4 S  0.0  0.0   0:00.00 mingetty                                                                                                                                                                       
 11787 root      20   0 90136   16    4 S  0.0  0.0   0:00.00 sshd                                                                                                                                                                           
 11791 ben.xion  20   0 94252   96    0 S  0.0  0.0   0:00.85 sshd                                                                                                                                                                           
 11792 ben.xion  20   0 65292    8    4 S  0.0  0.0   0:00.19 grep                                                                                                                                                                           
 12041 root      20   0 90136   20    4 S  0.0  0.0   0:00.00 sshd                                                                                                                                                                           
 12043 ben.xion  20   0 94156   64    0 S  0.0  0.0   0:00.95 sshd                                                                                                                                                                           
 12044 ben.xion  20   0 65292   12    4 S  0.0  0.0   0:00.17 grep                                                                                                                                                                           
 32054 root      20   0 90136 3288 2568 S  0.0  0.0   0:00.00 sshd                                                                                                                                                                           
 32056 qicheng.  20   0 90136 1736 1008 S  0.0  0.0   0:00.01 sshd                                                                                                                                                                           
 32057 qicheng.  20   0 68088 1600 1252 S  0.0  0.0   0:00.00 bash                                                                                                                                                                           
 32183 root      20   0  100m 1348 1060 S  0.0  0.0   0:00.00 su                                                                                                                                                                             
 32192 root      20   0 90136 3284 2568 S  0.0  0.0   0:00.00 sshd                                                                                                                                                                           
 32194 ben.xion  20   0 90136 1748 1012 S  0.0  0.0   0:00.31 sshd                                                                                                                                                                           
 32195 ben.xion  20   0 68088 1604 1264 S  0.0  0.0   0:00.03 bash                                                                                                                                                                           
 32232 admin     20   0 68088 1628 1264 S  0.0  0.0   0:00.01 bash                                                                                                                                                                           
 33086 root      20   0  200m  65m  404 S  0.0  0.8   0:00.00 nginx-proxy                                                                                                                                                                    
 33087 admin     20   0  7988  520  416 S  0.0  0.0   0:00.00 cronolog                                                                                                                                                                       
 33088 admin     20   0  7992  516  416 S  0.0  0.0   0:00.00 cronolog                                                                                                                                                                       
 33095 admin     20   0  200m  78m  12m S  0.0  1.0   0:00.02 nginx-proxy                                                                                                                                                                    
 34158 timetunn  20   0 99.6m 1444  916 S  0.0  0.0   0:00.00 crond                                                                                                                                                                          
 34167 timetunn  20   0 10744 1028  872 S  0.0  0.0   0:00.00 sh                                                                                                                                                                             
 34169 timetunn  20   0 10748 1084  920 S  0.0  0.0   0:00.00 runLogAgent.sh                                                                                                                                                                 
 34180 timetunn  20   0 10752 1240  988 S  0.0  0.0   0:00.00 logAgent4.sh                                                                                                                                                                   
 34196 timetunn  20   0  7924  480  396 S  0.0  0.0   0:00.00 sleep                                                                                                                                                                          
 37052 root      20   0 90136    8    4 S  0.0  0.0   0:00.00 sshd                                                                                                                                                                           
 37054 ben.xion  20   0 91724    8    0 S  0.0  0.0   0:00.08 sshd                                                                                                                                                                           
 37055 ben.xion  20   0 65292    8    4 S  0.0  0.0   0:00.97 grep                                                                                                                                                                           
259019 root      20   0 51344   12    0 S  0.0  0.0   0:00.00 syslog-ng                                                                                                                                                                      
259020 root      20   0 55272 2180  448 S  0.0  0.0  31:57.87 syslog-ng                                                                                                                                                                      
259330 root      20   0 30664   72   36 S  0.0  0.0   3:56.32 DragoonAgent                                                                                                                                                                   
259331 root      20   0 36376 2348  828 S  0.0  0.0 211:47.28 DragoonAgent                                                                                                                                                                   
259332 root      20   0  168m  964  548 S  0.0  0.0  75:24.63 DragoonAgent   

可以看到不管是dmesg系统日志还是top命令,都有很多的crondtab定时任务。 咨询了一下PE QiC,为什么一个纯web应用的虚拟机里会有这么多定时任务呢?

答:本宿主机上的跑任务的虚拟机挂了,影响了这个web应用,猜测虚拟化隔离没做好导致进程漏到其他虚拟机。具体要联系内核组来看这个虚拟化的问题了。

总结

1.线上不要试图编辑大文件,缓冲区会把内存吃满,导致其他正常进程被oom killer,而且这个进程很可能是java,因为oom_score都是0的情况下,它内存占用最大,最可能被killer。

如同线上不能直接jmap dump内存一样,会触发stop the world。

2.有时候错不在你,但与荣俱荣、唇亡齿寒。 我们的应用虽然一天好几亿的请求量,但终归是“长尾”集群,之前以为虚个8个10个的差不多了,今日细看,32核128G的物理机,虚了活活30台机器,里面五花八门,各种任务、服务、web应用的机器,充分利用了CPU、内存、IO、网络等等资源,真叫人叹为观止。