Note that there are some explanatory texts on larger screens.

plurals
  1. POwhat is the fastest way in python to convert a string with formatted numbers in an numpy array
    primarykey
    data
    text
    <p>I have a large ASCII file (~100GB) which consists of roughly 1.000.000 lines of known formatted numbers which I try to process with python. The file is too large to read in completely into memory, so I decided to process the file line by line:</p> <pre><code>fp = open(file_name) for count,line in enumerate(fp): data = np.array(line.split(),dtype=np.float) #do stuff fp.close() </code></pre> <p>It turns out, that I spend most of the run time of my program in the <code>data =</code> line. Are there any ways to speed up that line? Also, the execution speed seem much slower than what I could get from an native FORTRAN program with formated read (see this <a href="https://stackoverflow.com/questions/15834327/strange-accuracy-difference-between-ipython-and-ipython-notebook-then-using-fort">question</a>, I've implemented a FORTRAN string processor and used it with f2py, but the run time was only comparable with the <code>data =</code> line. I guess the I/O handling and type conversions between Python/FORTRAN killed what I gained from FORTRAN)</p> <p>Since I know the formatting, shouldn't there be a better and faster way as to use <code>split()</code>? Something like:</p> <pre><code>data = readf(line,'(1000F20.10)') </code></pre> <p>I tried the <a href="https://pypi.python.org/pypi/fortranformat" rel="nofollow noreferrer">fortranformat</a> package, which worked well, but in my case was three times slower than thee <code>split()</code> approach.</p> <p>P.S. As suggested by ExP and root I tried the np.fromstring and made this quick and dirtry benchmark:</p> <pre><code>t1 = time.time() for i in range(500): data=np.array(line.split(),dtype=np.float) t2 = time.time() print (t2-t1)/500 print data.shape print data[0] 0.00160977363586 (9002,) 0.0015162509 </code></pre> <p>and:</p> <pre><code>t1 = time.time() for i in range(500): data = np.fromstring(line,sep=' ',dtype=np.float,count=9002) t2 = time.time() print (t2-t1)/500 print data.shape print data[0] 0.00159792804718 (9002,) 0.0015162509 </code></pre> <p>so <code>fromstring</code> is in fact slightly slower in my case.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload