Solaris_blog: How to Analyze a Crash Dump

How to Analyze a Crash Dump

1.Download the Solaris Crash Analysis Tool (SCAT) from the Sun Website.

I 2. Install the package onto a Solaris server.

3. Update the Paths on the server to include the new SCAT package

# PATH=$PATH:/opt/SUNWscat/bin; export PATH

# echo $PATH

4. Now find out the servers savecore/dump directory and change to that directory.

# dumpadm

Dump content: kernel pages

Dump device: /dev/dsk/c1t1d0s1 (swap)

Savecore directory: /var/crash/sparc10

Savecore enabled: yes

# cd /var/crash/sparc10

5. Copy across the dump file the customer has provided you into the savecore directory.

6. If you wanted to analyse a dump file which ended in .2 (E.g. unix.2 & vmcore.2) you would run

the following command.

# scat 2

Solaris[TM] CAT 5.0 for Solaris 9 64-bit SPARC(sun4u)

SV4622M, Jul 3 2008

Use is subject to license terms.

Feedback regarding the tool should be sent to

SolarisCAT_Feedback@Sun.com

Visit the Solaris CAT blog at http://blogs.sun.com/SolarisCAT

opening unix.2 vmcore.2 ...dumphdr...symtab...core...done

loading core data: modules...symbols...done

loading CTF...stabs...Unable to load any default stabs file

done

core file: /var/crash/sparc10/vmcore.2

user: Super-User (root:0)

release: 5.9 (64-bit)

version: Generic_118558-22

machine: sun4u

node name: margay

hw_provider: Sun_Microsystems

system type: SUNW,Ultra-4 (UltraSPARC-II)

hostid: 80b1418e

dump_conflags: 0x10000 (DUMP_KERNEL) on /dev/dsk/c0t0d0s1(2.92G)

time of crash: Thu Oct 1 13:11:35 EST 2009

age of system: 3 minutes 56.12 seconds

panic CPU: 0 (4 CPUs, 1.00G memory)

panic string: [AFT1] errID 0x00000056.8f3a1a1d UE Error(s)

See previous message(s) for details

sanity checks: settings...vmem...CPU...sysent...clock...misc...

done

SolarisCAT(vmcore.2/9U)>

7. You are now presented with an overview of the system where the dump file came from. From

here we can start to run a series of commands to find out excatly what happened. There are

a wide range of commands which can be used in SCAT; i will only be mentioning a few of the

main ones.

- analyze analyse the core dump

- proc Show which processes running when the system crashed

- thread <thread #> Investigate a particular thread which a

- help Show all the commands available within SCAT.

8. When you run the analyze command look for the below output as it will tell you which

command failed and on which CPU.

==== panic thread: 0x2a10001fd40 ==== CPU: 0 ====

==== panic kernel thread: 0x2a10001fd40 PID: 0 on CPU: 0 ====

cmd: sched

9. From the above output it appears the ’sched’ command causes the panic on CPU 0. From

here you can use some of the other commands to dive deeper into the dump file to find out

why exactly it failed. When analysing dump files it is a good practice to also review the system

log when the crash occured as it also contains valuable information.

For more information on analysing crash dumps see the below links:

http://blogs.sun.com/solariscat/

http://www.cuddletech.com/blog/pivot/entry.php?id=966

http://www.itworld.com/use-solaris-cat-crash-dump-nlsunix-080424

Examples:

scat 2

Solaris[TM] CAT 5.0 for Solaris 11 64-bit x86

SV4622M, Jul 3 2008

Use is subject to license terms.

Feedback regarding the tool should be sent to SolarisCAT_Feedback@Sun.COM

Visit the Solaris CAT blog at http://blogs.sun.com/SolarisCAT

opening unix.0 vmcore.0 ...dumphdr...symtab...core...done

loading core data: modules...symbols...ctftype: unknown type struct panic_trap_info

CTF...done

core file: /var/crash/xxxxxxxx/vmcore.0

user: Super-User (root:0)

release: 5.11 (64-bit)

version: snv_67

machine: i86pc

node name: xxxxxxxxxxxxxxxxxx

system type: i86pc

hostid: xxxxxxxx

dump_conflags: 0x10000 (DUMP_KERNEL) on /dev/dsk/c0t0d0s1(24.0G)

time of crash: Mon Aug 25 07:41:00 GMT 2008 (core is 13 days old)

age of system: 91 days 22 hours 49 minutes 50.97 seconds

panic CPU: 1 (8 CPUs, 31.9G memory)

panic string: page_free pp=ffffff0007243bd8, pfn=11228e, lckcnt=0, cowcnt=0 slckcnt = 0

sanity checks: settings...vmem...

WARNING: FSS thread 0xffffff097d1e3400 on CPU2 using 99%CPU

WARNING: FSS thread 0xffffff09fddbab40 on CPU3 using 99%CPU

sysent...clock...misc...

NOTE: system has 54 non-global zones

done

SolarisCAT(vmcore.0/11X)>

SolarisCAT(vmcore.0/11X)> analyze

core file: /var/crash/xxxxxx/vmcore.0

user: Super-User (root:0)

release: 5.11 (64-bit)

version: snv_67

machine: i86pc

node name: xxxxxx

system type: i86pc

hostid: xxxxx

dump_conflags: 0x10000 (DUMP_KERNEL) on /dev/dsk/c0t0d0s1(24.0G)

time of crash: Mon Aug 25 07:41:00 GMT 2008 (core is 13 days old)

age of system: 91 days 22 hours 49 minutes 50.97 seconds

panic CPU: 1 (8 CPUs, 31.9G memory)

panic string: page_free pp=ffffff0007243bd8, pfn=11228e, lckcnt=0, cowcnt=0 slckcnt = 0

==== panic thread: 0xfffffffef4ce5dc0 ==== CPU: 1 ====

==== panic user (LWP_SYS) thread: 0xfffffffef4ce5dc0 PID: 10156 on CPU: 1 ====

cmd: /opt/local/sbin/httpd -k start

t_procp: 0xffffffff06595e50

p_as: 0xffffffff093490e0 size: 47374336 RSS: 3125248

hat: 0xffffffff092a9480 cpuset: 1

zone: address translation failed for zone_name addr: 8 bytes @ 0x3

t_stk: 0xffffff00486bcf10 sp: 0xffffff00486bc880 t_stkbase: 0xffffff00486b8000

t_pri: 3(FSS) pctcpu: 0.380035

t_lwp: 0xfffffffefe61ab60 lwp_regs: 0xffffff00486bcf10

mstate: LMS_SYSTEM ms_prev: LMS_SYSTEM

ms_state_start: 2 minutes 31.229022230 seconds earlier

ms_start: 2 minutes 31.343582414 seconds earlier

psrset: 0 last CPU: 1

idle: 0 ticks (0 seconds)

start: Mon Aug 25 07:41:00 2008

age: 0 seconds (0 seconds)

syscall: #131 memcntl(, 0x0) ()

tstate: TS_ONPROC - thread is being run on a processor

tflg: T_PANIC - thread initiated a system panic

T_DFLTSTK - stack is default size

tpflg: TP_MSACCT - collect micro-state accounting information

tsched: TS_LOAD - thread is in memory

TS_DONT_SWAP - thread/LWP should not be swapped

TS_RUNQMATCH

pflag: SMSACCT - process is keeping micro-state accounting

SMSFORK - child inherits micro-state accounting

pc: unix:vpanic_common+0x13b: addq $0xf0,%rsp

unix:vpanic_common+0x13b()

unix:panic+0x9c()

unix:page_free+0x22e()

unix:page_destroy+0x100()

genunix:fs_dispose+0x2e()

genunix:fop_dispose+0xdc()

genunix:pvn_getdirty+0x1f0()

zfs:zfs_putpage+0x129()

genunix:fop_putpage+0x65()

genunix:segvn_sync+0x39f()

genunix:as_ctl+0x1f2()

genunix:memcntl+0x709()

unix:_syscall32_save+0xbf()

-- switch to user thread's user stack --

This output provides a vast array of useful details, including:

System summary, including OS release and version, architecture, hostname, and hostid; as well as number of CPU's and memory
Time of crash and previous uptime ("age of system")
The panic string and CPU that it occurred on
The thread that caused the panic and its details, including the command (argc &argv), its memory footprint (size & rss), and zone
The threads state information, run time, start time, current syscall
The call stack

As noted in Part 1, what most people are really looking for when doing core analysis is to determine which application was responsable, and this output provides that data in great clarity. Lets dig into it a bit more explicitly... based on the above "analyze" output we can see that....

The system is an 8CPU X86 box running snv_67 (Solaris Nevada Build 67) in 64bit mode with 32GB of RAM.
System crashed on Aug 25th at 7:41AM GMT, it was previously up for 91 days
System paniced on "page_free" call, on CPU 1
The running thread was "httpd -k start"... an Apache worker process.
The process had the PID 10156, consumed 3.1MB of Physical Memory (RSS) and had a virtual size of 47MB
The process was using less than 1% (pctcpu) of CPU 1, was using the Fair Share Scheduler (FSS), on Processor Set (psrset) 0.
The process started on Aug 25th at 7:41AM GMT, it was 0 seconds old when it crashed... possibly a forked worker gone bad.

For many administrators this might be as much as you wanted to know, right there. But lets look at a couple more commands.

You'll recall that during the sanity checks at startup it noted 2 threads consuming full CPU's. We can feed the thread address to the "thread" command to get details on them:

SolarisCAT(vmcore.0/11X)> thread 0xffffff097d1e3400

==== user (LWP_SYS) thread: 0xffffff097d1e3400 PID: 27446 on CPU: 2 ====

cmd: nano svn-commit.tmp

t_procp: 0xffffffff2e908ab0

p_as: 0xffffffff10402ee0 size: 2772992 RSS: 1642496

hat: 0xffffffff102f6b48 cpuset: 2

zone: address translation failed for zone_name addr: 8 bytes @ 0x2

t_stk: 0xffffff004e47ef10 sp: 0xffffff003d3fcf08 t_stkbase: 0xffffff004e47a000

t_pri: 26(FSS) pctcpu: 99.306175

t_lwp: 0xffffffff202a78b0 lwp_regs: 0xffffff004e47ef10

mstate: LMS_SYSTEM ms_prev: LMS_USER

ms_state_start: 2 minutes 31.228983791 seconds earlier

ms_start: 39 days 19 hours 11 minutes 8.989252296 seconds earlier

psrset: 0 last CPU: 2

idle: 9 ticks (0.09 seconds)

start: Wed Jul 16 12:30:07 2008

age: 3438653 seconds (39 days 19 hours 10 minutes 53 seconds)

syscall: #98 sigaction(, 0x0) ()

tstate: TS_ONPROC - thread is being run on a processor

tflg: T_DFLTSTK - stack is default size

tpflg: TP_TWAIT - wait to be freed by lwp_wait

TP_MSACCT - collect micro-state accounting information

tsched: TS_LOAD - thread is in memory

TS_DONT_SWAP - thread/LWP should not be swapped

TS_RUNQMATCH

pflag: SMSACCT - process is keeping micro-state accounting

SMSFORK - child inherits micro-state accounting

pc: unix:panic_idle+0x23: jmp -0x2 (unix:panic_idle+0x23)

unix:panic_idle+0x23()

0xffffff003d3fcf60()

-- error reading next frame @ 0x0 --

So using the "thread" command we can get full granularity on a given thread. In fact, using the "tlist" command you can dump this information for every thread on the system at the time of crash.

Another nifty command is "tunables". This will display the "current value" (at time of the dump) and the default value. If someone's been experimenting on the production systems this will clue you in.

SolarisCAT(vmcore.0/11X)> tunables

Tunable Name Current Default Value Units Description

Value

physmem 8386375 * pages Physical memory

installed in system.

freemem 376628 * pages Available memory.

avefree 338943 * pages Average free memory

in the last 30 seconds

.........

Using the "dispq" command we can look at the dispatch queues (run queue). This answers "what other processes were running on CPU at the time of the crash", again, using the thread address we can dig into them with "thread":

SolarisCAT(vmcore.0/11X)> dispq

CPU thread pri PID cmd

0 @ 0xfffffffffbc26bb0 0xffffff003d005c80 -1 (idle)

pri 60 -=> 0xffffff004337dc80 60 0 sched

1 @ 0xfffffffec6634000 P 0xfffffffef4ce5dc0 P 3 10156 /opt/local/sbin/httpd -k start

2 @ 0xfffffffec662f000 0xffffff097d1e3400 26 27446 nano svn-commit.tmp

3 @ 0xfffffffec66f4800 0xffffff09fddbab40 25 21329 java -jar xxxxx.jar --ui=console

4 @ 0xfffffffec66ea800 0xffffff003d414c80 -1 (idle)

pri 60 -=> 0xffffff0048b12c80 60 0 sched

5 @ 0xfffffffec6770800 0xffffff003d4b0c80 -1 (idle)

6 @ 0xfffffffec6770000 0xffffff003d53bc80 -1 (idle)

7 @ 0xfffffffec6762000 0xffffff003d58fc80 -1 (idle)

part thread pri PID cmd

0 @ 0xfffffffffbc4eef0

There are far too many to go through in a blog entry... but lets look at my personal favorite, "zfs". The "zfs" command can show us the pool(s), their configuration, read/write/checksum/error stats, and even ARC stats!

SolarisCAT(vmcore.0/11X)> zfs -e

ZFS spa @ 0xfffffffec6c21540

Pool name: zones

State: ACTIVE

VDEV Address State Aux Description

0xfffffffec0a9e040 FAULTED - root

READ WRITE FREE CLAIM IOCTL

OPS 0 0 0 0 0

BYTES 0 0 0 0 0

EREAD 0

EWRITE 0

ECKSUM 0

VDEV Address State Aux Description

0xfffffffec0a9eac0 FAULTED - /dev/dsk/c0t1d0s0

READ WRITE FREE CLAIM IOCTL

OPS 74356305 578263155 0 0 0

BYTES 757G 10.4T 0 0 0

EREAD 0

EWRITE 0

ECKSUM 0

SolarisCAT(vmcore.0/11X)> zfs arc

ARC (Adaptive Replacement Cache) Stats:

misses 1930348

demand_data_hits 74303514929

demand_data_misses 1325511

demand_metadata_hits 620388795

demand_metadata_misses 160708

prefetch_data_hits 1361651307

....

I hope this helps you get an idea of how easy it is to really dig deeply into your core dumps using Solaris CAT to hide the oddities of mdb from you. Its a powerful and robust tool, and I'm glad that we have it.

Happy dump divin'! You'll be amazed how much you'll learn about your system.

The proc command, for example, can tell you about the processes that were running at the time your system crashed. These processes are listed by default in reverse PID order.

SolarisCAT(vmcore.0)> proc

addr pid ppid uid size rss swresv time command

------------- ------ ------ ------ ---------- -------- -------- ------ ---------

0x30003c8e040 283 1 0 3776512 1646592 1302528 90118 /usr/sbin/ssmon

0x30003c96a50 279 1 0 9306112 2514944 1769472 19 /usr/sbin/ssserver

0x30003bee030 256 1 0 27656192 2596864 1138688 57 /usr/sbin/nscd

0x30003c8ea58 243 1 0 2506752 1703936 466944 7 /usr/sbin/cron

0x30003c96038 240 1 0 18874368 2170880 2711552 7 /usr/sbin/syslogd

0x30000f60010 225 1 0 7217152 2400256 1146880 170 /usr/lib/autofs/automountd

0x300020c4a40 217 1 0 2260992 1572864 598016 3 /usr/lib/nfs/lockd

0x300020c5458 213 1 1 4677632 1974272 876544 2 /usr/lib/nfs/statd

0x300020c4028 201 1 0 2629632 2048000 835584 12 /usr/sbin/inetd -s

[...]

Solaris_blog

Thursday, December 26, 2013

How to Analyze a Crash Dump

No comments:

Post a Comment

About Me