Discussion:
[fedora-virt] I/O errors in guest when cache=none, qcow2 on Btrfs
Chris Murphy
2015-03-26 17:07:39 UTC
Permalink
Could someone check out this bug, and see if it needs upstream
attention? It's currently set to kernel; but I have no idea if it
would be libvirt or qemu upstreams to make aware of this.

The gist is that on Fedora 21 and 22, when virtio blkc + cache=none +
qcow2 on Btrfs, the guest OS (regardless of the file system it uses)
starts to experience many I/O errors. If the qcow2 is on XFS, or if
cache=writeback or writethrough, the problem doesn't happen.

Regression testing shows the problem does not happen on Fedora 20's
versions of libvirt and qemu, even with newer kernels. So maybe it's
not a kernel problem, or maybe it's a collision of kernel and libvirt
or qemu problem, hence the inquiry. Upstream Btrfs is aware of this
bug and are looking into it.

The fallout of the bug is that gnome-boxes experiences problems since
they're currently using cache=none by default (and there's no way to
change this in the GUI) when qcow2 is on Btrfs.

blk_update_request: I/O error, dev vda, sector XXXXXXXX when qcow2 is on Btrfs
https://bugzilla.redhat.com/show_bug.cgi?id=1204569
--
Chris Murphy
Kevin Wolf
2015-03-27 08:59:55 UTC
Permalink
Post by Chris Murphy
Could someone check out this bug, and see if it needs upstream
attention? It's currently set to kernel; but I have no idea if it
would be libvirt or qemu upstreams to make aware of this.
The gist is that on Fedora 21 and 22, when virtio blkc + cache=none +
qcow2 on Btrfs, the guest OS (regardless of the file system it uses)
starts to experience many I/O errors. If the qcow2 is on XFS, or if
cache=writeback or writethrough, the problem doesn't happen.
Regression testing shows the problem does not happen on Fedora 20's
versions of libvirt and qemu, even with newer kernels. So maybe it's
not a kernel problem, or maybe it's a collision of kernel and libvirt
or qemu problem, hence the inquiry. Upstream Btrfs is aware of this
bug and are looking into it.
Could you try using new qemu with old libvirt and vice versa? This way
it should be possible to isolate the component that triggers the change
in behaviour.

To be honest, it sounds much like a problem with the btrfs driver and
O_DIRECT to me. But if changing libvirt and qemu versions is enough to
trigger it, we need to have a look at them - even if it's just to
support the btrfs investigation.
Post by Chris Murphy
The fallout of the bug is that gnome-boxes experiences problems since
they're currently using cache=none by default (and there's no way to
change this in the GUI) when qcow2 is on Btrfs.
blk_update_request: I/O error, dev vda, sector XXXXXXXX when qcow2 is on Btrfs
https://bugzilla.redhat.com/show_bug.cgi?id=1204569
It might be helpful to get a trace of all write requests made by qemu to
the image file.

Can you please run qemu under strace while you reproduce? (You'll need
-f because I/O is done in worker threads; also restricting the trace to
pwrite and pwritev should help to reduce the noise level)

The other option would be using qemu's own tracing, but strace should be
more relevant at this point.

Kevin
Chris Murphy
2015-03-27 18:05:16 UTC
Permalink
Post by Kevin Wolf
Could you try using new qemu with old libvirt and vice versa?
Later today or this weekend it should be possible.
Post by Kevin Wolf
To be honest, it sounds much like a problem with the btrfs driver and
O_DIRECT to me. But if changing libvirt and qemu versions is enough to
trigger it, we need to have a look at them - even if it's just to
support the btrfs investigation.
I just discovered that Fedora 21 Live Workstation as guest does not
reproduce the problem, whereas Fedora 22 Live Workstation Beta TC4
does reproduce it.

So it seems there's something about the guest OS first of all; and
then it depends on the host version of either libvirt or qemu (or
both).
Post by Kevin Wolf
Post by Chris Murphy
blk_update_request: I/O error, dev vda, sector XXXXXXXX when qcow2 is on Btrfs
https://bugzilla.redhat.com/show_bug.cgi?id=1204569
It might be helpful to get a trace of all write requests made by qemu to
the image file.
I've done that and updated the bug.
--
Chris Murphy
Loading...