Note that there are some explanatory texts on larger screens.

plurals
  1. POLinux device driver unsafe FXSAVE/FXRSTOR bug -- any precedents?
    primarykey
    data
    text
    <p>There's a nasty problem that has temporarily stumped a number of engineers at my company trying to debug it.</p> <p>The C++ program is normally run on a cluster of multicore computers with MPI.</p> <p>It will run for a very long time -- perhaps days -- and then suddenly fail.</p> <p>Most of engineers working on it have eliminated any reasonable possibility of a bug in the program itself, so they're starting to assign blame to a possible hardware problem, but I suspect there must be a software problem in either a Linux kernel module or device driver.</p> <p>What is <em>suspect</em> is that a kernel module or device driver, in order to do some floating-point calculations, is doing FXSAVE/FXRSTOR in a manner that is unsafe on SMP systems. It could be something as simple as doing the FXSAVE to a static buffer in a kernel routine that needed to be reentrant. That would create a race condition bug that would very rarely corrupt the floating-point context of a thread.</p> <p>At the application level, what appears to be happening is that one or more bits of the MXCSR -- which is part of the FXSAVE/FXRSTOR context -- is suddenly changed, but there is no application code to change it.</p> <p>I encountered something similar many years ago on Windows, which ultimately turned out to be a bug in a video driver, such that when the application code was preempted by the operating system, some MXCSR bits in that thread's context were corrupted.</p> <p>I'm not an expert at Linux Kernel hacking or device driver development, but I'm reading that the reentrancy rules have been changing a lot; between non-SMP and SMP (multi-core) systems; between kernel versions; etc. So the possibility of a race-condition bug seems reasonable.</p> <p>So my question is: <strong>Are there any known Linux driver(or kernel) bugs that fit that description?</strong></p> <p>Any precedents that I could cite would be helpful, if they had similar symptoms. At this point, a lot of the people involved are (IMHO) wasting time thinking "well, there's no bug in my code, so it must be bad hardware." and I'd like to get them beyond that and looking for something more likely to be the true cause.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload