Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>You should probably try something like this (I've now tested it on the sample data you provided):</p> <pre><code>#!/usr/bin/ruby class A_splitter Title = /^[A-Z]+[^a-z]*$/ Byline = /^BY / Number = /^\d*$/ Blank_line = /^ *$/ attr_accessor :recent_lines,:in_references,:source_glob,:destination_path,:seen_in_last_file def initialize(src_glob,dst_path=nil) @recent_lines = [] @seen_in_last_file = {} @in_references = false @source_glob = src_glob @destination_path = dst_path @destination = STDOUT @buffer = [] split_em end def split_here if destination_path @destination.close if @destination @destination = nil else print "------------SPLIT HERE------------\n" end print recent_lines.shift @in_references = false end def at_page_break ((recent_lines[0] =~ Title and recent_lines[1] =~ Blank_line and recent_lines[2] =~ Number) or (recent_lines[0] =~ Number and recent_lines[1] =~ Blank_line and recent_lines[2] =~ Title)) end def print(*args) (@destination || @buffer) &lt;&lt; args end def split_em Dir.glob(source_glob).sort.each { |filename| if destination_path @destination.close if @destination @destination = File.open(File.join(@destination_path,filename),'w') print @buffer @buffer.clear end in_header = true File.foreach(filename) { |line| line.gsub!(/\f/,'') if in_header and seen_in_last_file[line] #skip it else seen_in_last_file.clear if in_header in_header = false recent_lines &lt;&lt; line seen_in_last_file[line] = true end 3.times {recent_lines.shift} if at_page_break if recent_lines[0] =~ Title and recent_lines[1] =~ Byline split_here elsif in_references and recent_lines[0] =~ Title and recent_lines[0] !~ /\d/ split_here elsif recent_lines.length &gt; 4 @in_references ||= recent_lines[0] =~ /^REFERENCES *$/ print recent_lines.shift end } } print recent_lines @destination.close if @destination end end A_splitter.new('bul_*_*_*.txt','test_dir') </code></pre> <p>Basically, run through the files in order, and within each file run through the lines in order, omitting from each file the lines that were present in the preceding file and printing the rest to STDOUT (from which it can be piped) unless a destination director is specified (called 'test_dir' in the example see the last line) in which case files are created in the specified directory with the same name as the file which contained the bulk of their contents.</p> <p>It also removes the page-break sections (journal title, author, and page number).</p> <p>It does two split tests:</p> <ul> <li>a test on the title/byline pair </li> <li>a test on the first title-line after a reference section</li> </ul> <p>(it should be obvious how to add tests for additional split-points).</p> <p>Retained for posterity:</p> <p>If you don't specify a destination directory it simply puts a split-here line in the output stream at the split point. This should make it easier for testing (you can just <code>less</code> the output) and when you want them in individual files just pipe it to <a href="http://www.gnu.org/software/coreutils/manual/html_node/csplit-invocation.html" rel="nofollow noreferrer"><code>csplit</code></a> (e.g. with</p> <pre><code>csplit -f abstracts - '---SPLIT HERE---' '{*}' </code></pre> <p>or something) to cut it up.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload