Thursday, December 26, 2013

How to Analyze a Crash Dump



1
How to Analyze a Crash Dump

2
     1.Download the Solaris Crash Analysis Tool (SCAT) from the Sun Website.
  
I       2. Install the package onto a Solaris server.

3. Update the Paths on the server to include the new SCAT package
# PATH=$PATH:/opt/SUNWscat/bin; export PATH
# echo $PATH

4. Now find out the servers savecore/dump directory and change to that directory.
# dumpadm
Dump content: kernel pages
Dump device: /dev/dsk/c1t1d0s1 (swap)
Savecore directory: /var/crash/sparc10
Savecore enabled: yes
# cd /var/crash/sparc10

5. Copy across the dump file the customer has provided you into the savecore directory.

6. If you wanted to analyse a dump file which ended in .2 (E.g. unix.2 & vmcore.2) you would run
the following command.


# scat 2
Solaris[TM] CAT 5.0 for Solaris 9 64-bit SPARC(sun4u)
SV4622M, Jul 3 2008
Copyright © 2008 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
Feedback regarding the tool should be sent to
SolarisCAT_Feedback@Sun.com
Visit the Solaris CAT blog at http://blogs.sun.com/SolarisCAT
opening unix.2 vmcore.2 ...dumphdr...symtab...core...done
loading core data: modules...symbols...done
loading CTF...stabs...Unable to load any default stabs file
done
core file: /var/crash/sparc10/vmcore.2
user: Super-User (root:0)
release: 5.9 (64-bit)
version: Generic_118558-22
machine: sun4u
node name: margay
hw_provider: Sun_Microsystems
2
3
system type: SUNW,Ultra-4 (UltraSPARC-II)
hostid: 80b1418e
dump_conflags: 0x10000 (DUMP_KERNEL) on /dev/dsk/c0t0d0s1(2.92G)
time of crash: Thu Oct 1 13:11:35 EST 2009
age of system: 3 minutes 56.12 seconds
panic CPU: 0 (4 CPUs, 1.00G memory)
panic string: [AFT1] errID 0x00000056.8f3a1a1d UE Error(s)
See previous message(s) for details
sanity checks: settings...vmem...CPU...sysent...clock...misc...
done
SolarisCAT(vmcore.2/9U)>

7. You are now presented with an overview of the system where the dump file came from. From
here we can start to run a series of commands to find out excatly what happened. There are
a wide range of commands which can be used in SCAT; i will only be mentioning a few of the
main ones.
- analyze analyse the core dump
- proc Show which processes running when the system crashed
- thread <thread #> Investigate a particular thread which a
- help Show all the commands available within SCAT.


8. When you run the analyze command look for the below output as it will tell you which
command failed and on which CPU.
==== panic thread: 0x2a10001fd40 ==== CPU: 0 ====
==== panic kernel thread: 0x2a10001fd40 PID: 0 on CPU: 0 ====
cmd: sched
9. From the above output it appears the ’sched’ command causes the panic on CPU 0. From
here you can use some of the other commands to dive deeper into the dump file to find out
why exactly it failed. When analysing dump files it is a good practice to also review the system
log when the crash occured as it also contains valuable information.
For more information on analysing crash dumps see the below links:
http://blogs.sun.com/solariscat/
http://www.cuddletech.com/blog/pivot/entry.php?id=966
http://www.itworld.com/use-solaris-cat-crash-dump-nlsunix-080424
3
Examples:

scat 2

  Solaris[TM] CAT 5.0 for Solaris 11 64-bit x86
    SV4622M, Jul  3 2008

  Copyright © 2008 Sun Microsystems, Inc. All rights reserved.
  Use is subject to license terms.

  Feedback regarding the tool should be sent to SolarisCAT_Feedback@Sun.COM
  Visit the Solaris CAT blog at http://blogs.sun.com/SolarisCAT

opening unix.0 vmcore.0 ...dumphdr...symtab...core...done
loading core data: modules...symbols...ctftype: unknown type struct panic_trap_info
CTF...done

core file:      /var/crash/xxxxxxxx/vmcore.0
user:           Super-User (root:0)
release:        5.11 (64-bit)
version:        snv_67
machine:        i86pc
node name:      xxxxxxxxxxxxxxxxxx
system type:    i86pc
hostid:         xxxxxxxx
dump_conflags:  0x10000 (DUMP_KERNEL) on /dev/dsk/c0t0d0s1(24.0G)
time of crash:  Mon Aug 25 07:41:00 GMT 2008 (core is 13 days old)
age of system:  91 days 22 hours 49 minutes 50.97 seconds
panic CPU:      1 (8 CPUs, 31.9G memory)
panic string:   page_free pp=ffffff0007243bd8, pfn=11228e, lckcnt=0, cowcnt=0 slckcnt = 0

sanity checks: settings...vmem...
WARNING: FSS thread 0xffffff097d1e3400 on CPU2 using 99%CPU
WARNING: FSS thread 0xffffff09fddbab40 on CPU3 using 99%CPU
sysent...clock...misc...
NOTE: system has 54 non-global zones
done
SolarisCAT(vmcore.0/11X)>

SolarisCAT(vmcore.0/11X)> analyze

core file:      /var/crash/xxxxxx/vmcore.0
user:           Super-User (root:0)
release:        5.11 (64-bit)
version:        snv_67
machine:        i86pc
node name:      xxxxxx
system type:    i86pc
hostid:         xxxxx
dump_conflags:  0x10000 (DUMP_KERNEL) on /dev/dsk/c0t0d0s1(24.0G)
time of crash:  Mon Aug 25 07:41:00 GMT 2008 (core is 13 days old)
age of system:  91 days 22 hours 49 minutes 50.97 seconds
panic CPU:      1 (8 CPUs, 31.9G memory)
panic string:   page_free pp=ffffff0007243bd8, pfn=11228e, lckcnt=0, cowcnt=0 slckcnt = 0


==== panic thread: 0xfffffffef4ce5dc0 ==== CPU: 1 ====
==== panic user (LWP_SYS) thread: 0xfffffffef4ce5dc0  PID: 10156  on CPU: 1 ====
cmd: /opt/local/sbin/httpd -k start
t_procp: 0xffffffff06595e50
  p_as: 0xffffffff093490e0  size: 47374336  RSS: 3125248
  hat: 0xffffffff092a9480  cpuset: 1
  zone: address translation failed for zone_name addr: 8 bytes @ 0x3

t_stk: 0xffffff00486bcf10  sp: 0xffffff00486bc880  t_stkbase: 0xffffff00486b8000
t_pri: 3(FSS)  pctcpu: 0.380035
t_lwp: 0xfffffffefe61ab60  lwp_regs: 0xffffff00486bcf10
  mstate: LMS_SYSTEM  ms_prev: LMS_SYSTEM
  ms_state_start: 2 minutes 31.229022230 seconds earlier
  ms_start: 2 minutes 31.343582414 seconds earlier
psrset: 0  last CPU: 1 
idle: 0 ticks (0 seconds)
start: Mon Aug 25 07:41:00 2008
age: 0 seconds (0 seconds)
syscall: #131 memcntl(, 0x0) ()
tstate: TS_ONPROC - thread is being run on a processor
tflg:   T_PANIC - thread initiated a system panic
        T_DFLTSTK - stack is default size
tpflg:  TP_MSACCT - collect micro-state accounting information
tsched: TS_LOAD - thread is in memory
        TS_DONT_SWAP - thread/LWP should not be swapped
        TS_RUNQMATCH
pflag:  SMSACCT - process is keeping micro-state accounting
        SMSFORK - child inherits micro-state accounting

pc:      unix:vpanic_common+0x13b:  addq   $0xf0,%rsp

unix:vpanic_common+0x13b()
unix:panic+0x9c()
unix:page_free+0x22e()
unix:page_destroy+0x100()
genunix:fs_dispose+0x2e()
genunix:fop_dispose+0xdc()
genunix:pvn_getdirty+0x1f0()
zfs:zfs_putpage+0x129()
genunix:fop_putpage+0x65()
genunix:segvn_sync+0x39f()
genunix:as_ctl+0x1f2()
genunix:memcntl+0x709()
unix:_syscall32_save+0xbf()
-- switch to user thread's user stack --
This output provides a vast array of useful details, including:
  • System summary, including OS release and version, architecture, hostname, and hostid; as well as number of CPU's and memory
  • Time of crash and previous uptime ("age of system")
  • The panic string and CPU that it occurred on
  • The thread that caused the panic and its details, including the command (argc &argv), its memory footprint (size & rss), and zone
  • The threads state information, run time, start time, current syscall
  • The call stack
As noted in Part 1, what most people are really looking for when doing core analysis is to determine which application was responsable, and this output provides that data in great clarity. Lets dig into it a bit more explicitly... based on the above "analyze" output we can see that....
  • The system is an 8CPU X86 box running snv_67 (Solaris Nevada Build 67) in 64bit mode with 32GB of RAM.
  • System crashed on Aug 25th at 7:41AM GMT, it was previously up for 91 days
  • System paniced on "page_free" call, on CPU 1
  • The running thread was "httpd -k start"... an Apache worker process.
  • The process had the PID 10156, consumed 3.1MB of Physical Memory (RSS) and had a virtual size of 47MB
  • The process was using less than 1% (pctcpu) of CPU 1, was using the Fair Share Scheduler (FSS), on Processor Set (psrset) 0.
  • The process started on Aug 25th at 7:41AM GMT, it was 0 seconds old when it crashed... possibly a forked worker gone bad.
For many administrators this might be as much as you wanted to know, right there. But lets look at a couple more commands.
You'll recall that during the sanity checks at startup it noted 2 threads consuming full CPU's. We can feed the thread address to the "thread" command to get details on them:

SolarisCAT(vmcore.0/11X)> thread 0xffffff097d1e3400
==== user (LWP_SYS) thread: 0xffffff097d1e3400  PID: 27446  on CPU: 2 ====
cmd: nano svn-commit.tmp
t_procp: 0xffffffff2e908ab0
  p_as: 0xffffffff10402ee0  size: 2772992  RSS: 1642496
  hat: 0xffffffff102f6b48  cpuset: 2
  zone: address translation failed for zone_name addr: 8 bytes @ 0x2

t_stk: 0xffffff004e47ef10  sp: 0xffffff003d3fcf08  t_stkbase: 0xffffff004e47a000
t_pri: 26(FSS)  pctcpu: 99.306175
t_lwp: 0xffffffff202a78b0  lwp_regs: 0xffffff004e47ef10
  mstate: LMS_SYSTEM  ms_prev: LMS_USER
  ms_state_start: 2 minutes 31.228983791 seconds earlier
  ms_start: 39 days 19 hours 11 minutes 8.989252296 seconds earlier
psrset: 0  last CPU: 2 
idle: 9 ticks (0.09 seconds)
start: Wed Jul 16 12:30:07 2008
age: 3438653 seconds (39 days 19 hours 10 minutes 53 seconds)
syscall: #98 sigaction(, 0x0) ()
tstate: TS_ONPROC - thread is being run on a processor
tflg:   T_DFLTSTK - stack is default size
tpflg:  TP_TWAIT - wait to be freed by lwp_wait
        TP_MSACCT - collect micro-state accounting information
tsched: TS_LOAD - thread is in memory
        TS_DONT_SWAP - thread/LWP should not be swapped
        TS_RUNQMATCH
pflag:  SMSACCT - process is keeping micro-state accounting
        SMSFORK - child inherits micro-state accounting

pc:      unix:panic_idle+0x23:  jmp    -0x2     (unix:panic_idle+0x23)

unix:panic_idle+0x23()
0xffffff003d3fcf60()
-- error reading next frame @ 0x0 --
So using the "thread" command we can get full granularity on a given thread. In fact, using the "tlist" command you can dump this information for every thread on the system at the time of crash.
Another nifty command is "tunables". This will display the "current value" (at time of the dump) and the default value. If someone's been experimenting on the production systems this will clue you in.
SolarisCAT(vmcore.0/11X)> tunables  
    Tunable Name     Current   Default Value  Units      Description
                     Value                              
    physmem          8386375   *              pages      Physical memory
                                                         installed in system.
    freemem          376628    *              pages      Available memory.
    avefree          338943    *              pages      Average free memory
                                                         in the last 30 seconds
.........
Using the "dispq" command we can look at the dispatch queues (run queue). This answers "what other processes were running on CPU at the time of the crash", again, using the thread address we can dig into them with "thread":
SolarisCAT(vmcore.0/11X)> dispq
      CPU                  thread               pri        PID cmd
  0 @ 0xfffffffffbc26bb0   0xffffff003d005c80    -1            (idle)
               pri  60 -=> 0xffffff004337dc80    60          0 sched
  1 @ 0xfffffffec6634000 P 0xfffffffef4ce5dc0 P   3      10156 /opt/local/sbin/httpd -k start
  2 @ 0xfffffffec662f000   0xffffff097d1e3400    26      27446 nano svn-commit.tmp
  3 @ 0xfffffffec66f4800   0xffffff09fddbab40    25      21329 java -jar xxxxx.jar --ui=console
  4 @ 0xfffffffec66ea800   0xffffff003d414c80    -1            (idle)
               pri  60 -=> 0xffffff0048b12c80    60          0 sched
  5 @ 0xfffffffec6770800   0xffffff003d4b0c80    -1            (idle)
  6 @ 0xfffffffec6770000   0xffffff003d53bc80    -1            (idle)
  7 @ 0xfffffffec6762000   0xffffff003d58fc80    -1            (idle)

      part                 thread               pri        PID cmd
  0 @ 0xfffffffffbc4eef0
There are far too many to go through in a blog entry... but lets look at my personal favorite, "zfs". The "zfs" command can show us the pool(s), their configuration, read/write/checksum/error stats, and even ARC stats!
SolarisCAT(vmcore.0/11X)> zfs -e
ZFS spa @ 0xfffffffec6c21540
    Pool name: zones
    State: ACTIVE
       VDEV Address      State    Aux   Description
    0xfffffffec0a9e040  FAULTED    -       root

            READ   WRITE   FREE   CLAIM   IOCTL 
    OPS        0      0     0      0      0
    BYTES      0      0     0      0      0

    EREAD       0
    EWRITE      0
    ECKSUM      0

            VDEV Address      State    Aux     Description
         0xfffffffec0a9eac0  FAULTED    -    /dev/dsk/c0t1d0s0

                  READ      WRITE     FREE   CLAIM   IOCTL 
         OPS     74356305  578263155     0      0      0
         BYTES       757G      10.4T     0      0      0

         EREAD       0
         EWRITE      0
         ECKSUM      0
SolarisCAT(vmcore.0/11X)> zfs arc

ARC (Adaptive Replacement Cache) Stats:

    hits                       77708247444
    misses                         1930348
    demand_data_hits           74303514929
    demand_data_misses             1325511
    demand_metadata_hits         620388795
    demand_metadata_misses          160708
    prefetch_data_hits          1361651307
....
I hope this helps you get an idea of how easy it is to really dig deeply into your core dumps using Solaris CAT to hide the oddities of mdb from you. Its a powerful and robust tool, and I'm glad that we have it.
Happy dump divin'! You'll be amazed how much you'll learn about your system.

The proc command, for example, can tell you about the processes that were running at the time your system crashed. These processes are listed by default in reverse PID order.
SolarisCAT(vmcore.0)> proc

    addr       pid    ppid   uid      size      rss     swresv   time  command
------------- ------ ------ ------ ---------- -------- -------- ------ ---------
0x30003c8e040    283      1      0    3776512  1646592  1302528  90118 /usr/sbin/ssmon
0x30003c96a50    279      1      0    9306112  2514944  1769472     19 /usr/sbin/ssserver
0x30003bee030    256      1      0   27656192  2596864  1138688     57 /usr/sbin/nscd
0x30003c8ea58    243      1      0    2506752  1703936   466944      7 /usr/sbin/cron
0x30003c96038    240      1      0   18874368  2170880  2711552      7 /usr/sbin/syslogd
0x30000f60010    225      1      0    7217152  2400256  1146880    170 /usr/lib/autofs/automountd
0x300020c4a40    217      1      0    2260992  1572864   598016      3 /usr/lib/nfs/lockd
0x300020c5458    213      1      1    4677632  1974272   876544      2 /usr/lib/nfs/statd
0x300020c4028    201      1      0    2629632  2048000   835584     12 /usr/sbin/inetd -s
[...]



No comments:

Post a Comment