Note that there are some explanatory texts on larger screens.

plurals
  1. POData loss after deadlock - SQL Server 2008, Ruby on Rails, Apache, Phusion Passenger, Linux, FreeTDS
    primarykey
    data
    text
    <p>Data loss after deadlock - SQL Server 2008, Ruby on Rails, Phusion Passenger, Linux, FreeTDS</p> <p>I am confronted with a mystery problem that caused a data loss in a Ruby on Rails intranet application that I am responsible for. Apologies if this is not strictly speaking about a programming problem - well at least I maintain the application's Ruby code. The problem has so far occurred three times in two years.</p> <p>The environment:</p> <ul> <li>Linux - RedHat Enterprise Server 5.2</li> <li>Apache 2 web server (httpd-2.2.3-11.el5_2.4.rpm)</li> <li>Phusion Passenger 2.2.15</li> <li>Ruby 1.8.7, Rails 2.3.8, with gems: <ul> <li>actionmailer (2.3.8)</li> <li>actionpack (2.3.8)</li> <li>activerecord (2.3.8)</li> <li>activerecord-sqlserver-adapter (2.3.8)</li> <li>activeresource (2.3.8)</li> <li>activesupport (2.3.8)</li> <li>akami (1.2.0)</li> <li>builder (3.0.0)</li> <li>exception_notification (2.3.3.0)</li> <li>fastthread (1.0.7)</li> <li>gyoku (0.4.6)</li> <li>httpi (1.1.1)</li> <li>mime-types (1.16)</li> <li>nokogiri (1.4.4)</li> <li>nori (1.1.3)</li> <li>passenger (2.2.15)</li> <li>rack (1.1.0)</li> <li>rails (2.3.8)</li> <li>rake (0.8.7)</li> <li>ruby-net-ldap (0.0.4)</li> <li>rubyjedi-actionwebservice (2.3.5.20100714122544)</li> <li>savon (1.1.0)</li> <li>wasabi (2.5.1)</li> <li>will_paginate (2.3.14)</li> </ul></li> <li>SQL Server 2008 database server</li> <li>Database access through ActiveRecord</li> <li>Database driver: freetds-0.82, unixODBC-2.3.0.tar.gz, ruby-odbc-0.99991.tar.gz</li> </ul> <p>Symptoms:</p> <ul> <li>user actions requesting locks on database resources were involved in a deadlock situation.</li> <li>SQL Server resolved the deadlock by killing process(es) involved in the deadlock, so that at least some of them could complete successfully. </li> <li>On the Rails application side, the deadlocks resulted in unhandled exceptions (that I got notified on through the exception_notification gem)</li> <li>After the deadlock, the number of active Rails processes was increasing (which triggered another notification of our monitoring system), the processes seemed to be hanging </li> <li>The reason for why this happened is unknown. The processes seemed to be hanging in database operations (according to Rails logs). Normally I would have expected that SQL server's deadlock resolution function does not leave blocking processes hanging around.</li> <li>In the first two cases, I had restarted the web server as a reaction to the exceptions/hanging processes. In the third case (I was on vacation), nobody reacted on the notifications, but a cronjob running on the weekend was apparently stopping the processes too (soft restart through Passenger by touching "restart.txt", with the same effect)</li> <li>After the web server restarts, users reported a data loss. Before the web server restart, data was processed as expected - from the users' point of view. Rails logs and data in other systems that communicate with ours seem to indicate that the transactions had been properly committed. After the web server restart, suddenly all of the database changes since the time the deadlocks had occurred were missing. E.g., we have a "users" table that has a "last_access" column which is updated on every user action. After the web server restart, the newest "last_access" value was one day old. All transactions seemed to be missing, only the @@IDENTITY values continued with the values that were set before the data loss.</li> <li>I have received information from our IT (who maintain the database server) that seems to indicate that all of the lost DB operations were part of one huge transaction, which was missing a final COMMIT. Of course, what I would expect is that every Rails user action runs one or more separate transasctions, but the SQL Server transaction log shows all of the operations as part of that one huge transaction.</li> </ul> <p>It looks to me as if something like this happened:</p> <ul> <li>A bug in one of the involved components (e.g. Phusion Passenger, FreeTDS, SQL Server) caused the Rails processes that were running in parallel to share a database connection, and maybe caused also the hanging of processes.</li> <li>One of the involved processes was in a transaction and hanging somewhere before the COMMIT</li> <li>Since the other processes shared the same connection (as I assume), they were also in the same transaction</li> <li>Since the processes shared the connection, the users were able to see the data changes (before the web server restart), even though a COMMIT was pending.</li> <li>The web server restart forced the connection to abort and the transaction to be rolled back.</li> </ul> <p>Would that make sense? I'm wondering if anybody had similar experiences or hints where I could look further into. I suspected a bug in Passenger which may have forked the file descriptor of the database connection, but I cannot reproduce it. Passenger seems to properly create new DB connections on every fork.</p> <p>I am considering changing the database's isolation model to "read committed snapshot" to reduce the number of deadlocks, but I'm aware that this doesn't fix the root cause and that this might cause other problems to me.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload