Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Overview:</p> <p>A <strong>naïve!!</strong> algorithm keeps track of the frequency of words in a per-column manner, where one can assume that each line can be separated into columns with a delimiter.</p> <p>Example input:</p> <blockquote> <p>The dog jumped over the moon<br> The cat jumped over the moon<br> The moon jumped over the moon<br> The car jumped over the moon </p> </blockquote> <p>Frequencies:</p> <pre><code>Column 1: {The: 4} Column 2: {car: 1, cat: 1, dog: 1, moon: 1} Column 3: {jumped: 4} Column 4: {over: 4} Column 5: {the: 4} Column 6: {moon: 4} </code></pre> <p>We could partition these frequency lists further by grouping based on the total number of fields, but in this simple and convenient example, we are only working with a fixed number of fields (6).</p> <p>The next step is to iterate through lines which generated these frequency lists, so let's take the first example.</p> <ol> <li><strong>The</strong>: meets some hand-wavy criteria and the algorithm decides it must be static.</li> <li><strong>dog</strong>: doesn't appear to be static based on the rest of the frequency list, and thus it must be dynamic as opposed to static text. We loop through a few pre-defined regular expressions and come up with <code>/[a-z]+/i</code>.</li> <li><strong>over</strong>: same deal as #1; it's static, so leave as is.</li> <li><strong>the</strong>: same deal as #1; it's static, so leave as is.</li> <li><strong>moon</strong>: same deal as #1; it's static, so leave as is.</li> </ol> <p>Thus, just from going over the first line we can put together the following regular expression:</p> <pre><code>/The ([a-z]+?) jumps over the moon/ </code></pre> <p>Considerations:</p> <ul> <li><p>Obviously one can choose to scan part or the whole document for the first pass, as long as one is confident the frequency lists will be a sufficient sampling of the entire data.</p></li> <li><p>False positives may creep into the results, and it will be up to the filtering algorithm (hand-waving) to provide the best threshold between static and dynamic fields, or some human post-processing.</p></li> <li><p>The overall idea is probably a good one, but the actual implementation will definitely weigh in on the speed and efficiency of this algorithm.</p></li> </ul>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload