Note that there are some explanatory texts on larger screens.

plurals
  1. POHow to handle new files to process in cron job
    text
    copied!<p>How can I check files that I already processed in a script so I don't process those again? and/or What is wrong with the way I am doing this now?</p> <p>Hello, I am running <a href="http://www.wireshark.org/docs/man-pages/tshark.html" rel="nofollow noreferrer">tshark</a> with the ring buffer option to dump to files after 5MB or 1 hour. I wrote a python script to read these files in XML and dump into a database, this works fine.</p> <p>My issue is that this is really process intensive, one of those 5MB can turn into a 200MB file when converted to XML, so I do not want to do any unnecessary processing.</p> <p>The script is running every 10 minutes and processes ~5 files per run, since is scanning the folder where the files are created for any new entries, I dump a hash of the file into the database and on the next run check the hash and if it isn't in the database I scan the file. The problem is that this does not appear to work every time, it ends up processing files that it has already <em>done</em>. When I check the hash of the file that it keeps trying to process it doesn't show up anywhere in the database, hence why is trying to process it over and over. </p> <p>I am printing out the filename + hash in the output of the script: <PRE> using file /var/ss01/SS01_00086_20100107100828.cap with hash: 982d664b574b84d6a8a5093889454e59 using file /var/ss02/SS02_00053_20100106125828.cap with hash: 8caceb6af7328c4aed2ea349062b74e9 using file /var/ss02/SS02_00075_20100106184519.cap with hash: 1b664b2e900d56ca9750d27ed1ec28fc using file /var/ss02/SS02_00098_20100107104437.cap with hash: e0d7f5b004016febe707e9823f339fce using file /var/ss02/SS02_00095_20100105132356.cap with hash: 41a3938150ec8e2d48ae9498c79a8d0c using file /var/ss02/SS02_00097_20100107103332.cap with hash: 4e08b6926c87f5967484add22a76f220 using file /var/ss02/SS02_00090_20100105122531.cap with hash: 470b378ee5a2f4a14ca28330c2009f56 using file /var/ss03/SS03_00089_20100107104530.cap with hash: 468a01753a97a6a5dfa60418064574cc using file /var/ss03/SS03_00086_20100105122537.cap with hash: 1fb8641f10f733384de01e94926e0853 using file /var/ss03/SS03_00090_20100107105832.cap with hash: d6209e65348029c3d211d1715301b9f8 using file /var/ss03/SS03_00088_20100107103248.cap with hash: 56a26b4e84b853e1f2128c831628c65e using file /var/ss03/SS03_00072_20100105093543.cap with hash: dca18deb04b7c08e206a3b6f62262465 using file /var/ss03/SS03_00050_20100106140218.cap with hash: 36761e3f67017c626563601eaf68a133 using file /var/ss04/SS04_00010_20100105105912.cap with hash: 5188dc70616fa2971d57d4bfe029ec46 using file /var/ss04/SS04_00071_20100107094806.cap with hash: ab72eaddd9f368e01f9a57471ccead1a using file /var/ss04/SS04_00072_20100107100234.cap with hash: 79dea347b04a05753cb4ff3576883494 using file /var/ss04/SS04_00070_20100107093350.cap with hash: 535920197129176c4d7a9891c71e0243 using file /var/ss04/SS04_00067_20100107084826.cap with hash: 64a88ecc1253e67d49e3cb68febb2e25 using file /var/ss04/SS04_00042_20100106144048.cap with hash: bb9bfa773f3bf94fd3af2514395d8d9e using file /var/ss04/SS04_00007_20100105101951.cap with hash: d949e673f6138af2d388884f4a6b0f08 </PRE></p> <p>The only files it should be doing are one per folder, so only 4 files. This causes unecessary processing and I have to deal with overlapping cron jobs + other services been affected.</p> <p>What I am hoping to get from this post is a better way to do this or hopefully someone can tell me why is happening, I know that the latter might be hard since it can be a bunch of reasons. </p> <p>Here is the <a href="http://python.pastebin.com/m744c2673" rel="nofollow noreferrer">code</a> (I am not a coder but a sys admin so be kind :P) line 30-32 handle the hash comparisons. Thanks in advance.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload