[fedora-virt] [Qemu-devel] Fedora FC21 - Bug: 100% CPU and hangs in gettimeofday(&tp, NULL); forever

Discussion:

Paolo Bonzini

2015-01-13 20:48:43 UTC

It happens also with qemu-kvm 2.2.0 on another VM where also PostgreSQL
(gdb) bt
#0 0x00007fff9a1feff4 in gettimeofday ()
#1 0x00000000006d425e in GetCurrentTimestamp () at timestamp.c:1274
OK : F20: 3.17.6-200.fc20.x86_64 on guest/host,
qemu-kvm-1.6.2-10.fc20.x86_64 on host
NOK: F21: 3.17.7-300.fc21.x86_64 on guest/host,
qemu-kvm-2.1.2-7.fc21.x86_64 on host
NOK: F21: 3.17.8-300.fc21.x86_64 on guest/host,
qemu-kvm-2.2.0-1.fc21.x86_64 on host
No one less can reproduce or has similar problems?
Any further ideas?

Hmm, too bad. :(

Any chance you can try with 1.7 and 2.0 releases? Cole, perhaps you
help building some RPMS for Gerhard?

Paolo

Cole Robinson

2015-01-13 21:08:02 UTC

Permalink

Post by Paolo Bonzini

Hmm, too bad. :(
Any chance you can try with 1.7 and 2.0 releases? Cole, perhaps you
help building some RPMS for Gerhard?
Paolo

There's F21 qemu-2.0 builds here:
http://koji.fedoraproject.org/koji/buildinfo?buildID=538383
And 1.7 builds here: http://koji.fedoraproject.org/koji/buildinfo?buildID=499026

There might be dependency issues trying to install on latest f21, if they were
built against older libraries. Worth a shot though. Besides that something
custom will need to be built, though at that point it's probably easier to
build qemu from source

- Cole

Paolo Bonzini

2015-01-13 21:16:08 UTC

Permalink

http://lxr.free-electrons.com/source/kernel/time/timekeeping.c?v=3.17#L493
499 do {
500 seq = read_seqcount_begin(&tk_core.seq);
501
502 ts->tv_sec = tk->xtime_sec;
503 nsecs = timekeeping_get_ns(&tk->tkr);
504
505 } while (read_seqcount_retry(&tk_core.seq, seq));
So it looks like that the seqcount always changes and therefore loops
forever here (as far as I digged it down this is the only loop here).
Might be something wrong with the memory barriers in recent qemu-kvm
releases?

No, that's not possible. Unless you pause/resume or migrate the VM, all
of the handling of kvmclock is entirely in the kernel.

Paolo

Paolo Bonzini

2015-02-16 14:18:19 UTC

Permalink

ppoll([{fd=98<socket:[4660317]>, events=POLLIN|POLLERR|POLLHUP},
{fd=89<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=88<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=87<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=86<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=85<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=84<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=83<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=82<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=81<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=80<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=79<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=78<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=77<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=76<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=75<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=74<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=73<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=72<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=71<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=70<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=69<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=68<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=67<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=66<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=65<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=64<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=63<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=62<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=61<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=60<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP},
{fd=59<anon_inode:[eventfd]>, events=POLLIN|POLLERR|POLLHUP}, ...], 71,
{0, 124112228}, NULL, 8) = 1 ([...], left {0, 124090614})
write(7<anon_inode:[eventfd]>, "\1\0\0\0\0\0\0\0", 8) = 8
read(6<anon_inode:[eventfd]>, "\1\0\0\0\0\0\0\0", 512) = 8
write(7<anon_inode:[eventfd]>, "\1\0\0\0\0\0\0\0", 8) = 8

Can you grab some random backtraces ("thread apply all bt full") with gdb?

What is the libvirt XML or qemu command line?

Paolo

Kernel (host/guest): 3.18.6-200.fc21.x86_64 #1 SMP
qemu-kvm-2.2.0-5.fc21.x86_64
Bug 1178975 - endless loop in clock_gettime() on a kvm-based VM
https://bugzilla.redhat.com/show_bug.cgi?id=1178975
is fixed (didn't occour with the test program posted at
https://bugzilla.redhat.com/show_bug.cgi?id=1178975#c28 in 30min,
happened before reproduceable in 2min, still running)
So I guess there is another problem in the kernel with volatile and gcc
optimizations (or maybe in qemu-KVM)

No, this doesn't look like volatile.

Paolo Bonzini

2015-02-16 15:29:25 UTC

Permalink

Post by Paolo Bonzini

Post by Paolo Bonzini
Can you grab some random backtraces ("thread apply all bt full") with gdb?

Very low load on the machine, so I guess most will be sleeping and hard
to catch non sleeping ones. See below

This likely means that the 100% threads are not the I/O (event loop)
threads, but the VCPU threads.

Post by Paolo Bonzini

Post by Paolo Bonzini
Kernel (host/guest): 3.18.6-200.fc21.x86_64 #1 SMP
qemu-kvm-2.2.0-5.fc21.x86_64
Bug 1178975 - endless loop in clock_gettime() on a kvm-based VM
https://bugzilla.redhat.com/show_bug.cgi?id=1178975
is fixed (didn't occour with the test program posted at
https://bugzilla.redhat.com/show_bug.cgi?id=1178975#c28 in 30min,
happened before reproduceable in 2min, still running)
So I guess there is another problem in the kernel with volatile and gcc
optimizations (or maybe in qemu-KVM)

No, this doesn't look like volatile.

But why 100% on 2 cores?

Because you have 2 VCPUs, I guess.

Paolo

Paolo Bonzini

2015-03-02 09:26:31 UTC

Permalink

So far it happened only the PostgreSQL database VM. Kernel is alive
(ping works well). ssh is not working.
[1438.384864] Out of memory: Kill process 10115 (pg_dump) score 112 or
sacrifice child
[1438.384990] Killed process 10115 (pg_dump) total-vm: 340548kB,
anon-rss: 162712kB, file-rss: 220kB

Can you get a vmcore or at least sysrq-t output?

Paolo

VM uptime is ~1 day, 2 cores 100%CPU.
VM is very stripped down, nevertheless is should have enough memory (and
also swap). Looks like it crashed at the nighlty backup (pg_dumpall),

${DEST_SQL}

free
total used free shared buff/cache
available
Mem: 372264 28536 214780 20884 128948 303360
Swap: 1081340 0 1081340
Ciao,
Gerhard
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007feee62412c1 in ppoll () from /lib64/libc.so.6
#0 0x00007feee6242977 in ioctl () at /lib64/libc.so.6
#1 0x00007feef11d1c35 in kvm_vcpu_ioctl ()
#2 0x00007feef11d1cec in kvm_cpu_exec ()
#3 0x00007feef11bfb02 in qemu_kvm_cpu_thread_fn ()
#4 0x00007feeefcb352a in start_thread () at /lib64/libpthread.so.0
#5 0x00007feee624c79d in clone () at /lib64/libc.so.6
#0 0x00007feee6242977 in ioctl () at /lib64/libc.so.6
#1 0x00007feef11d1c35 in kvm_vcpu_ioctl ()
#2 0x00007feef11d1cec in kvm_cpu_exec ()
#3 0x00007feef11bfb02 in qemu_kvm_cpu_thread_fn ()
#4 0x00007feeefcb352a in start_thread () at /lib64/libpthread.so.0
#5 0x00007feee624c79d in clone () at /lib64/libc.so.6
/lib64/libpthread.so.0
#1 0x00007feef1466d79 in qemu_cond_wait ()
#2 0x00007feef13eadd3 in vnc_worker_thread_loop ()
#3 0x00007feef13eb1b8 in vnc_worker_thread ()
#4 0x00007feeefcb352a in start_thread () at /lib64/libpthread.so.0
#5 0x00007feee624c79d in clone () at /lib64/libc.so.6
#0 0x00007feee62412c1 in ppoll () at /lib64/libc.so.6
#1 0x00007feef13fc89c in qemu_poll_ns ()
#2 0x00007feef13fc034 in main_loop_wait ()
#3 0x00007feef1197cdd in main ()

Paolo Bonzini

2015-03-03 10:12:46 UTC

Permalink

virsh dump vm file.core
error: Failed to core dump domain vm to file.core
error: internal error: unable to execute QEMU command 'migrate': State
blocked by non-migratable device '0000:00:09.0/ich9_ahci'
Removed the SATA controller, dump should work for the future.

Thanks.

Paolo

Laine Stump

2015-01-14 05:34:46 UTC

Permalink

Post by Paolo Bonzini

Hmm, too bad. :(
Any chance you can try with 1.7 and 2.0 releases? Cole, perhaps you
help building some RPMS for Gerhard?

(I responded to this thread on qemu-devel already, but the message
didn't get back to this list because I'm subscribed here with a
different address)

Take a look at the following kernel bug. It specifically deals with a
hang in gettimeofday() in a KVM guest:

https://bugzilla.redhat.com/show_bug.cgi?id=1178975

There is a link to a patched kernel you can try; installing that kernel in my guest fixed my problems (I
was repeatedly getting hangs in python-urlgrabber during yum updates on
F21).