The hunt for a kernel bug, part 4: git bisect

Now that I have a way to compile kernels from source, I want to find the exact commit where my input devices stop working. That means doing a git checkout of a certain commit, build the kernel, install it, reboot, select the new kernel in Grub, and see if my keyboard works. I am quite sure that I need to search between 5.13.0-22 and 5.13.0-23, but that’s still 634 commits!

$ git rev-list Ubuntu-5.13.0-22.22..Ubuntu-5.13.0-23.23 | wc --lines
634

This is where git bisect comes in. It’s sort of a wizard that guides you to find a bad commit. You tell it on which commit your software was known to work ok, and a commit where it doesn’t. It then picks a commit somewhere in the middle, you build your software and do your tests, and then tell git bisect if the result was good or bad. It will then give you a new commit to inspect, each time narrowing the search.

git-bisect-sumo-logic
git bisect

Let’s do this!

$ git bisect start
$ git bisect good Ubuntu-5.13.0-22.22
$ git bisect bad Ubuntu-5.13.0-23.23
Bisecting: 316 revisions left to test after this (roughly 8 steps)
[398351230dab42d654036847a49a5839705abdcb] powerpc/bpf ppc32: Fix BPF_SUB when imm == 0x80000000
$ git describe --long 
Ubuntu-5.13.0-22.22-317-g398351230dab

In this first step, I get the 317th commit after 5.13.0-22. Let’s compile that commit:

$ time make clean olddefconfig bindeb-pkg \
    --jobs=$(getconf _NPROCESSORS_ONLN) \
    LOCALVERSION=-$(git describe --long | tr '[:upper:]' '[:lower:]')

This creates 3 .deb packages in the directory above:

$ ls -1 ../*$(git describe --long | tr '[:upper:]' '[:lower:]')*.deb
../linux-headers-5.13.19-ubuntu-5.13.0-22.22-317-g398351230dab_5.13.19-ubuntu-5.13.0-22.22-317-g398351230dab-10_amd64.deb
../linux-image-5.13.19-ubuntu-5.13.0-22.22-317-g398351230dab_5.13.19-ubuntu-5.13.0-22.22-317-g398351230dab-10_amd64.deb
../linux-libc-dev_5.13.19-ubuntu-5.13.0-22.22-317-g398351230dab-10_amd64.deb

I only need to install the headers and the image, libc-dev isn’t needed.

$ sudo dpkg --install ../linux-{headers,image}-*$(git describe --long | tr '[:upper:]' '[:lower:]')*.deb

Verify that the kernel files are in the /boot directory:

$ ls -1 /boot/*$(git describe --long | tr '[:upper:]' '[:lower:]')*
/boot/config-5.13.19-ubuntu-5.13.0-22.22-317-g398351230dab
/boot/initrd.img-5.13.19-ubuntu-5.13.0-22.22-317-g398351230dab
/boot/System.map-5.13.19-ubuntu-5.13.0-22.22-317-g398351230dab
/boot/vmlinuz-5.13.19-ubuntu-5.13.0-22.22-317-g398351230dab

Now I can reboot, select the new kernel in Grub, and test the keyboard. With commit 317, the keyboard worked, so the first bad commit has to be somewhere between commit 317 and 634:

$ git bisect good ; git describe --long
Bisecting: 158 revisions left to test after this (roughly 7 steps)
[79b62d0bba892e8367cb46ca09b623c885852c29] drm/msm/a4xx: fix error handling in a4xx_gpu_init()
Ubuntu-5.13.0-22.22-475-g79b62d0bba89

Now it’s time again for make clean olddefconfig bindeb-pkg, dpkg --install and reboot. Turns out that commit 475 was a “bad” commit (one where the keyboard didn’t work):

$ git bisect bad ; git describe --long
Bisecting: 78 revisions left to test after this (roughly 6 steps)
[c3d35f3acc3a11b726959c7b2c25ab9e46310273] USB: serial: option: add Telit LE910Cx composition 0x1204
Ubuntu-5.13.0-22.22-396-gc3d35f3acc3a

I’m not going to describe all the steps in full detail, by now you should get the gist of it. This was the sequence of steps that git bisect gave me:

  • 317: good
  • 475: bad
  • 396: bad
  • 356: good
  • 376: good
  • 386: good
  • 391: bad
  • 388: bad
  • 387: bad

And then we finally get the first bad commit, the 387th commit after 5.13.0-22, Ubuntu-5.13.0-22.22-387-g0fc979747dec:

$ git bisect bad ; git describe --long
0fc979747dece96c189bc29ef604e61afbddfa2a is the first bad commit
commit 0fc979747dece96c189bc29ef604e61afbddfa2a
Author: Pavankumar Kondeti <pkondeti@codeaurora.org>
Date:   Fri Oct 8 12:25:46 2021 +0300

    xhci: Fix command ring pointer corruption while aborting a command
    
    BugLink: https://bugs.launchpad.net/bugs/1951880
    
    commit ff0e50d3564f33b7f4b35cadeabd951d66cfc570 upstream.
    
    The command ring pointer is located at [6:63] bits of the command
    ring control register (CRCR). All the control bits like command stop,
    abort are located at [0:3] bits. While aborting a command, we read the
    CRCR and set the abort bit and write to the CRCR. The read will always
    give command ring pointer as all zeros. So we essentially write only
    the control bits. Since we split the 64 bit write into two 32 bit writes,
    there is a possibility of xHC command ring stopped before the upper
    dword (all zeros) is written. If that happens, xHC updates the upper
    dword of its internal command ring pointer with all zeros. Next time,
    when the command ring is restarted, we see xHC memory access failures.
    Fix this issue by only writing to the lower dword of CRCR where all
    control bits are located.
    
    Cc: stable@vger.kernel.org
    Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
    Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
    Link: https://lore.kernel.org/r/20211008092547.3996295-5-mathias.nyman@linux.intel.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Signed-off-by: Kamal Mostafa <kamal@canonical.com>
    Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

 drivers/usb/host/xhci-ring.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)
Ubuntu-5.13.0-22.22-387-g0fc979747dec

At first sight the commit description is quite cryptic, and the actual code change doesn’t tell me a lot either. But it’s a change in drivers/usb/host/xhci-ring.c, and xhci stands for eXtensible Host Controller Interface, and interface specification for USB host controllers. If it’s an issue with the USB host controller, then it makes sense that if I use 2 keyboards from different brands, neither of them would work. It also suggests that other USB devices, like external hard drives, wouldn’t work either, but that’s a bit harder to test. A keyboard is easy. Just look at NumLock LED, if it doesn’t go on then there’s an issue.

The first link in the commit description is just a long list of patches that were taken from upstream and integrated in the Ubuntu kernel, so that doesn’t help me. The second link is a thread on the kernel.org mailing list, and there it gets interesting.

kernel.org mailing list thread

Some excerpts from the thread:

This patch cause suspend to disk resume usb not work, xhci_hcd 0000:00:14.0: Abort failed to stop command ring: -110.

youling257

Thanks for the report, this is odd.
Could you double check that by reverting this patch resume start working again.
If this is the case maybe we need to write all 64bits before this xHC hardware reacts to CRCR register changes.
Maybe following changes on top of current patch could help:

Mathias Nyman

Every time a developer says “this is odd”, my alarm bells go off. 😀

Further down in the thread there is a proposed update to the change. I’m going to try that patch, but that’s for another blog post.

One comment on “The hunt for a kernel bug, part 4: git bisect”

Leave a Reply