Note that there are some explanatory texts on larger screens.

plurals
  1. POPython) How to reduce runtime of big datasets parsing
    primarykey
    data
    text
    <p>I'm gonna read, parse and integrate two huge text files as input and then create new file. <br> There are also extra another file which is used for this parsing. <br>Briefly explaining, two text files have about 100 millions of rows and three columns. <br>First, read two different files and write matched two values into new files. <br>If there is no matched value from one of input files, 0.0 will be inserted into the matrix of each row. <br>For boosting the efficiency of this parsing, I made another input file which is union file about 1st column (key) from two text files as follows. <br> I tested this code with small input files (10000 of rows). It worked well. I started running this code with huge big datasets two days before, unfortunately it is still running. <br>How to reduce the running time and parse it efficiently?</p> <p>1st_infile.txt</p> <pre><code>MARCH2_MARCH2 2.3 0.1 MARCH2_MARC2 -0.2 0 MARCH2_MARCH5 -0.3 0.3 MARCH2_MARCH6 -1.4 0 MARCH2_MARCH7 0.1 0 MARCH2_SEPT2 -1.0 0 MARCH2_SEPT4 0.8 0 </code></pre> <p>2nd_infile.txt</p> <pre><code>MARCH2_MARCH2 2.2 0 MARCH2_MARCH2.1 0.2 0 MARCH2_MARCH3 -0.4 0 MARCH2_MARCH5 -0.3 0 MARCH2_MARCH6 -0.6 0 MARCH2_MARCH7 1.2 0 MARCH2_SEPT2 0.2 0 </code></pre> <p>union_file.txt</p> <pre><code>MARCH2_MARCH2 MARCH2_MARCH2.1 MARCH2_MARC2 MARCH2_MARCH5 MARCH2_MARCH6 MARCH2_MARCH7 MARCH2_SEPT2 MARCH2_SEPT4 MARCH2_MARCH3 </code></pre> <p>Outfile.txt </p> <pre><code>MARCH2_MARCH2 2.3 0.1 2.2 0 MARCH2_MARCH2.1 0.0 0.0 0.2 0 MARCH2_MARC2 -0.2 0 0.0 0.0 MARCH2_MARCH5 -0.3 0.3 -0.3 0 MARCH2_MARCH6 -1.4 0 -0.6 0 MARCH2_MARCH7 1.2 0 1.2 0 MARCH2_SEPT2 -1.0 0 0.2 0 MARCH2_SEPT4 0.8 0 0.0 0.0 MARCH2_MARCH3 0.0 0.0 -0.4 0 </code></pre> <p>Python.py</p> <pre><code>def load(filename): ret = {} with open(filename) as f: for lineno, line in enumerate(f, 1): try: name, value1, value2 = line.split() except ValueError: print('Skip invalid line {}:{}L {0!r}'.format(filename, lineno, line)) continue ret[name] = value1, value2 return ret a, b = load('1st_infile.txt'), load('2nd_infile.txt') with open ('Union_file.txt') as f: with open('Outfile.txt', 'w') as fout: for line in f: name = line.strip() fout.write('{0:&lt;20} {1[0]:&gt;5} {1[1]:&gt;5} {2[0]:&gt;5} {2[1]:&gt;5}\n'.format( name, a.get(name, (0, 0)), b.get(name, (0, 0)) )) </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload