Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Do it in code. Open up a cursor, then: grab a row, run it through the 40 regular expressions, and if it changed, save it back. Repeat until the cursor doesn't give you any more rows.</p> <p>Whether you do it that way or come up with the magical SQL expression, it's still going to be a row scan of the entire table, but the code will be much simpler.</p> <p><strong>Experimental Results</strong></p> <p>In response to criticism, I ran an experiment. I inserted 10,000 lines from a documentation file into a table with a serial primary key and a varchar column. Then I tested two ways to do the update. Method 1:</p> <pre><code>in a transaction: opened up a cursor (select for update) while reading 100 rows from the cursor returns any rows: for each row: for each regular expression: do the gsub on the text column update the row </code></pre> <p>This takes 1.16 seconds with a locally connected database.</p> <p>Then the "big replace," a single mega-regex update:</p> <blockquote> <p>update foo set t = regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(t, E'\bcommit\b', E'COMMIT'), E'\b9acf10762b5f3d3b1b33ea07792a936a25e45010\b', E'9ACF10762B5F3D3B1B33EA07792A936A25E45010'), E'\bAuthor:\b', E'AUTHOR:'), E'\bCarl\b', E'CARL'), E'\bWorth\b', E'WORTH'), E'\b\b', E''), E'\bDate:\b', E'DATE:'), E'\bMon\b', E'MON'), E'\bOct\b', E'OCT'), E'\b26\b', E'26'), E'\b04:53:13\b', E'04:53:13'), E'\b2009\b', E'2009'), E'\b-0700\b', E'-0700'), E'\bUpdate\b', E'UPDATE'), E'\bversion\b', E'VERSION'), E'\bto\b', E'TO'), E'\b2.9.1\b', E'2.9.1'), E'\bcommit\b', E'COMMIT'), E'\b61c89e56f361fa860f18985137d6bf53f48c16ac\b', E'61C89E56F361FA860F18985137D6BF53F48C16AC'), E'\bAuthor:\b', E'AUTHOR:'), E'\bCarl\b', E'CARL'), E'\bWorth\b', E'WORTH'), E'\b\b', E''), E'\bDate:\b', E'DATE:'), E'\bMon\b', E'MON'), E'\bOct\b', E'OCT'), E'\b26\b', E'26'), E'\b04:51:58\b', E'04:51:58'), E'\b2009\b', E'2009'), E'\b-0700\b', E'-0700'), E'\bNEWS:\b', E'NEWS:'), E'\bAdd\b', E'ADD'), E'\bnotes\b', E'NOTES'), E'\bfor\b', E'FOR'), E'\bthe\b', E'THE'), E'\b2.9.1\b', E'2.9.1'), E'\brelease.\b', E'RELEASE.'), E'\bThanks\b', E'THANKS'), E'\bto\b', E'TO'), E'\beveryone\b', E'EVERYONE'), E'\bfor\b', E'FOR')</p> </blockquote> <p>The mega-regex update takes 0.94 seconds to update.</p> <p>At 0.94 seconds compared to 1.16, it's true that the mega-regex update is faster, running in 81% of the time of doing it in code. It is not, however a lot faster. And ye Gods, look at that update statement. Do you want to write that, or try to figure out what went wrong when Postgres complains that you dropped a parenthesis somewhere?</p> <p><strong>Code</strong></p> <p>The code used was:</p> <pre><code> def stupid_regex_replace sql = Select.new sql.select('id') sql.select('t') sql.for_update sql.from(TABLE_NAME) Cursor.new('foo', sql, {}, @db) do |cursor| until (rows = cursor.fetch(100)).empty? for row in rows for regex, replacement in regexes row['t'] = row['t'].gsub(regex, replacement) end end sql = Update.new(TABLE_NAME, @db) sql.set('t', row['t']) sql.where(['id = %s', row['id']]) sql.exec end end end </code></pre> <p>I generated the regular expressions dynamically by taking words from the file; for each word "foo", its regular expression was "\bfoo\b" and its replacement string was "FOO" (the word uppercased). I used words from the file to make sure that replacements did happen. I made the test program spit out the regex's so you can see them. Each pair is a regex and the corresponding replacement string:</p> <pre><code>[[/\bcommit\b/, "COMMIT"], [/\b9acf10762b5f3d3b1b33ea07792a936a25e45010\b/, "9ACF10762B5F3D3B1B33EA07792A936A25E45010"], [/\bAuthor:\b/, "AUTHOR:"], [/\bCarl\b/, "CARL"], [/\bWorth\b/, "WORTH"], [/\b&lt;cworth@cworth.org&gt;\b/, "&lt;CWORTH@CWORTH.ORG&gt;"], [/\bDate:\b/, "DATE:"], [/\bMon\b/, "MON"], [/\bOct\b/, "OCT"], [/\b26\b/, "26"], [/\b04:53:13\b/, "04:53:13"], [/\b2009\b/, "2009"], [/\b-0700\b/, "-0700"], [/\bUpdate\b/, "UPDATE"], [/\bversion\b/, "VERSION"], [/\bto\b/, "TO"], [/\b2.9.1\b/, "2.9.1"], [/\bcommit\b/, "COMMIT"], [/\b61c89e56f361fa860f18985137d6bf53f48c16ac\b/, "61C89E56F361FA860F18985137D6BF53F48C16AC"], [/\bAuthor:\b/, "AUTHOR:"], [/\bCarl\b/, "CARL"], [/\bWorth\b/, "WORTH"], [/\b&lt;cworth@cworth.org&gt;\b/, "&lt;CWORTH@CWORTH.ORG&gt;"], [/\bDate:\b/, "DATE:"], [/\bMon\b/, "MON"], [/\bOct\b/, "OCT"], [/\b26\b/, "26"], [/\b04:51:58\b/, "04:51:58"], [/\b2009\b/, "2009"], [/\b-0700\b/, "-0700"], [/\bNEWS:\b/, "NEWS:"], [/\bAdd\b/, "ADD"], [/\bnotes\b/, "NOTES"], [/\bfor\b/, "FOR"], [/\bthe\b/, "THE"], [/\b2.9.1\b/, "2.9.1"], [/\brelease.\b/, "RELEASE."], [/\bThanks\b/, "THANKS"], [/\bto\b/, "TO"], [/\beveryone\b/, "EVERYONE"], [/\bfor\b/, "FOR"]] </code></pre> <p>If this were a hand-generated list of regex's, and not automatically generated, my question is still appropriate: Which would you rather have to create or maintain?</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload