Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>parsing the same file in parallel threads does NOT add speed but just costs extra ressources</p> <p>A less problematic and more efficient text2db optimisation consists of:</p> <ul> <li>bulk read the file (rather then line by line read 1 MB at once, process it, read next MB)</li> <li><p>bulk insert into the database - mysql like this:</p> <pre><code>insert into urtable values ('val1','val2'), ('val1','val2'); </code></pre></li> </ul> <p>(example stolen from <a href="http://bytes.com/topic/sql-server/answers/585793-insert-into-using-select-values-inserting-multiple-rows" rel="nofollow">http://bytes.com/topic/sql-server/answers/585793-insert-into-using-select-values-inserting-multiple-rows</a> - sorry for being too lazy to make one up by myself)</p> <ul> <li>try to prevent sql back and forth (means: if select-output is required from the database to enrich your dataset read it upfront and not on and on while walking through the file)</li> </ul> <p>UPDATE ----</p> <p>From the comment I took that there might be a need to get data from the database, while parsing the file. Well, if you have to do, you have to do. BUT: Try to not have to do that.</p> <p>First of all: Reading specific data can be seen caching or not. In a narrow understanding caching is just moving disk data to memory by any heuristics (without knowing what is going on). I personally try to avoid this because heuristics can play against you. In a wider understanding caching is what I described before PLUS put data from disk to memory, which you can pinpoint down (eg. by ID or any filter criteria). So I still do not like the narrow understanding part of this but the behaviour selecting well defined data upfront.</p> <p>Secondly: My personal experiences go like that: IF you are working on a fully normalized data model database read operations in file parsings very often cook down to 'give me the primary key(s)' of what I dumped before into the database. This appears to become tricky when you write multiple rows at once. However especially in MySQL you can definitely rely on 'each insert statement (even multiple row inserts) is atomistic', you get the ID from last_insert_id() and so you can track this down to all your previously written records. I am pretty sure there are similar 'cock downs' for other database systems.</p> <p>Thirdly: Parsing LARGE files is something I would try to operate as job with only ONE technical user triggering that with ensuring that NOT >1 of these processes run in parallel. Otherwise you need to work around all sort of issues starting with file locking going into sessions permission read/write management. So running this as job classifies this (at least in my personal policies) to allocate LOTS of RAM - depending on costs and how important speed is. That means I would not even bother to load a 100 K row keyword-to-id table into memory upfront.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload