Note that there are some explanatory texts on larger screens.

plurals
  1. POIs this "should not happen" crash an AMD Fusion CPU bug?
    primarykey
    data
    text
    <p>My company has started having a number of customers call in because our program is crashing with an access violation on their systems.</p> <p>The crash happens in SQLite 3.6.23.1, which we ship as part of our application. (We ship a custom build, in order to use the same VC++ libraries as the rest of the app, but it's the stock SQLite code.)</p> <p>The crash happens when <code>pcache1Fetch</code> executes <code>call 00000000</code>, as shown by the WinDbg callstack:</p> <pre><code>0b50e5c4 719f9fad 06fe35f0 00000000 000079ad 0x0 0b50e5d8 719f9216 058d1628 000079ad 00000001 SQLite_Interop!pcache1Fetch+0x2d [sqlite3.c @ 31530] 0b50e5f4 719fd581 000079ad 00000001 0b50e63c SQLite_Interop!sqlite3PcacheFetch+0x76 [sqlite3.c @ 30651] 0b50e61c 719fff0c 000079ad 0b50e63c 00000000 SQLite_Interop!sqlite3PagerAcquire+0x51 [sqlite3.c @ 36026] 0b50e644 71a029ba 0b50e65c 00000001 00000e00 SQLite_Interop!getAndInitPage+0x1c [sqlite3.c @ 40158] 0b50e65c 71a030f8 000079ad 0aecd680 071ce030 SQLite_Interop!moveToChild+0x2a [sqlite3.c @ 42555] 0b50e690 71a0c637 0aecd6f0 00000000 0001edbe SQLite_Interop!sqlite3BtreeMovetoUnpacked+0x378 [sqlite3.c @ 43016] 0b50e6b8 71a109ed 06fd53e0 00000000 071ce030 SQLite_Interop!sqlite3VdbeCursorMoveto+0x27 [sqlite3.c @ 50624] 0b50e824 71a0db76 071ce030 0b50e880 071ce030 SQLite_Interop!sqlite3VdbeExec+0x14fd [sqlite3.c @ 55409] 0b50e850 71a0dcb5 0b50e880 21f9b4c0 00402540 SQLite_Interop!sqlite3Step+0x116 [sqlite3.c @ 51744] 0b50e870 00629a30 071ce030 76897ff4 70f24970 SQLite_Interop!sqlite3_step+0x75 [sqlite3.c @ 51806] </code></pre> <p>The relevant line of C code is: </p> <pre><code>if( createFlag==1 ) sqlite3BeginBenignMalloc(); </code></pre> <p>The compiler inlines <code>sqlite3BeginBenignMalloc</code>, which is defined as:</p> <pre><code>typedef struct BenignMallocHooks BenignMallocHooks; static SQLITE_WSD struct BenignMallocHooks { void (*xBenignBegin)(void); void (*xBenignEnd)(void); } sqlite3Hooks = { 0, 0 }; # define wsdHooksInit # define wsdHooks sqlite3Hooks SQLITE_PRIVATE void sqlite3BeginBenignMalloc(void){ wsdHooksInit; if( wsdHooks.xBenignBegin ){ wsdHooks.xBenignBegin(); } } </code></pre> <p>And the assembly for this is:</p> <pre><code>719f9f99 mov esi,dword ptr [esp+1Ch] 719f9f9d cmp esi,1 719f9fa0 jne SQLite_Interop!pcache1Fetch+0x2d (719f9fad) 719f9fa2 mov eax,dword ptr [SQLite_Interop!sqlite3Hooks (71a7813c)] 719f9fa7 test eax,eax 719f9fa9 je SQLite_Interop!pcache1Fetch+0x2d (719f9fad) 719f9fab call eax ; *** CRASH HERE *** 719f9fad mov ebx,dword ptr [esp+14h] </code></pre> <p>The registers are:</p> <pre><code>eax=00000000 ebx=00000001 ecx=000013f0 edx=fffffffe esi=00000001 edi=00000000 eip=00000000 esp=0b50e5c8 ebp=000079ad iopl=0 nv up ei pl nz na po nc cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010202 </code></pre> <p>If <code>eax</code> is 0 (which it is), the zero flag should be set by <code>test eax, eax</code>, but it's non-zero. Because the zero flag isn't set, <code>je</code> doesn't jump, and then the app crashes trying to execute <code>call eax (00000000)</code>.</p> <p><em>Update</em>: <code>eax</code> should always be 0 here because <code>sqlite3Hooks.xBenignBegin</code> is not set in our build of the code. I could rebuild SQLite with <code>SQLITE_OMIT_BUILTIN_TEST</code> defined, which would turn on <code>#define sqlite3BeginBenignMalloc()</code> in the code and omit this code path entirely. That may solve the issue, but it doesn't feel like a "real" fix; what would stop it happening in some other code path?</p> <p>So far the common factor is that all customers are running "Windows 7 Home Premium 64-bit (6.1, Build 7601) Service Pack 1" and have one of the following CPUs (according to DxDiag):</p> <ul> <li>AMD A6-3400M APU with Radeon(tm) HD Graphics (4 CPUs), ~1.4GHz</li> <li>AMD A8-3500M APU with Radeon(tm) HD Graphics (4 CPUs), ~1.5GHz</li> <li>AMD A8-3850 APU with Radeon(tm) HD Graphics (4 CPUs), ~2.9GHz</li> </ul> <p>According to Wikipedia's <a href="http://en.wikipedia.org/wiki/AMD_Fusion">AMD Fusion article</a>, these are all "Llano" model AMD Fusion chips based on the K10 core and were released in June 2011, which is when we first started getting reports.</p> <p>The most common customer system is the Toshiba Satellite L775D, but we also have crash reports from HP Pavilion dv6 &amp; dv7 and Gateway systems.</p> <p>Could this crash be caused by a CPU error (see <a href="http://support.amd.com/us/Processor_TechDocs/44739.pdf">Errata for AMD Family 12h Processors</a>), or is there some other possible explanation that I'm overlooking? (According to Raymond, it <a href="http://blogs.msdn.com/b/oldnewthing/archive/2005/04/12/407562.aspx">could be overclocking</a>, but it's odd that just this specific CPU model is affected, if so.)</p> <p>Honestly, it doesn't seem possible that it's really a CPU or OS error, because the customers aren't getting bluescreens or crashes in other applications. There must be some other, more likely, explanation--but what?</p> <p><em>Update 15 August:</em> I've acquired a Toshiba L745D notebook with an AMD A6-3400M processor and can reproduce the crash consistently when running the program. The crash is always on the same instruction; <code>.time</code> reports anywhere from 1m30s to 7m of user time before the crash. One fact (that may be pertinent to the issue) that I neglected to mention in the original post is that the application is multi-threaded and has both high CPU and I/O usage. The application spawns four worker threads by default and posts 80+% CPU usage (there is some blocking for I/O as well as for mutexes in the SQLite code) until it crashes. I modified the application to only use two threads, and it still crashed (although it took longer to happen). I'm now running a test with just one thread, and it hasn't crashed yet.</p> <p>Note also that it doesn't appear to be purely a CPU load problem; I can run Prime95 without errors on the system and it will boost the CPU temperature to >70°C, while my application barely gets the temperature above 50°C while it's running.</p> <p><em>Update 16 August:</em> Perturbing the instructions slightly makes the problem "go away". For eaxmple, replacing the memory load (<code>mov eax,dword ptr [SQLite_Interop!sqlite3Hooks (71a7813c)]</code>) with <code>xor eax, eax</code> prevents the crash. Modifying the original C code to add an extra check to the <code>if( createFlag==1 )</code> statement changes the relative offsets of various jumps in the compiled code (as well as the location of the <code>test eax, eax</code> and <code>call eax</code> statements) and also seems to prevent the problem.</p> <p>The strangest result I've found so far is that changing the <code>jne</code> at <code>719f9fa0</code> to two <code>nop</code> instructions (so that control <em>always</em> falls through to the <code>test eax, eax</code> instruction, no matter what the value of <code>createFlag</code>/<code>esi</code> is) allows the program to run without crashing.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload