Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>It sounds like you're working with a set of issues more than a specific C# feeding speed issue.</p> <p>A few questions up front - is this 60gb data to be consumed every weekend or is it an initial backfill of the system ? Does the data exist as items on the filesystem local to the ESP install or elseware ? Is this a single internal ESP deployment or a solution you're looking to replicate in multiple places ? Single node install or multiple (or rather ... how many - single node cap is 20 docprocs) ?</p> <p>ESP performance is usually limited by number of documents to be handled more than the number of files. Assuming your data ranges between email size 35k data and filesystem size 350k data you 60gb equates to between 180k docs and 1.8mil docs, so to feed that over 48hrs you need to feed between 3750 and 37500 documents per hour. Not a very high target on modern hardware (if you installed this on a VM ... well... all bets are off, it'd be better off on a laptop).</p> <p>For feeding you have a choice between faster coding &amp; more control with either managing the batches fed yourself or using the DocumentFeeder framework in the api which abstracts a lot of the batch logic. If you're just going for 37.5k docs/hr I'd save the overhead and just use DocumentFeeder - though take care in its config params. Document feeder will allow you to treat your content on a per document basis instead of creating the batches yourself, it will also allow for some measure of automatically retrying based on config. General target should be for a max of 50mb content per batch or 100 docs, whichever comes first. Larger docs should be sent in smaller batches... so if you have a 50mb file, it should ideally be sent by itself, etc. You'd actually lose the control of the batches formed by document feeder... so the logic there is kinda a best effort on the part of your code.</p> <p>Use the callbacks to monitor how well the content is making it into the system. Set limits on how many documents have been fed that you haven't received the final callbacks for yet. Target should be for X batches to be submitted at any given time -or- Y Mb, pause at either cutoff. X should be about 20 + # of document processors, Y should be in the area of 500-1000Mb. With document feeder it's just a pass/fail per doc, with the traditional system it's more detailed. Only wait for the 'secured' callback ... that tells you it's been processed &amp; will be indexed... waiting for it to be searchable is pointless.</p> <p>Set some limits on your content... in general ESP will break down with very large files, there's a hard limit at 2gb since it's still 32bit procs, but in reality anything over 50mb should only have the metadata fed in. Also... avoid feeding log data, it'll crush the internal structures, killing perf if not erroring out. Things can be done in the pipeline to modify what's searchable to ease the pain of some log data.</p> <p>Also need to make sure your index is configured to well, at least 6 partitions with a focus on keeping the lower order ones fairly empty. Hard to go into the details of that one without knowing more about the deployment. The pipeline config can have a big impact as well... no document should ever take 5-8 hours. Make sure to replace any searchexport or htmlexport stages being used with custom instances with a sane timeout (30-60 sec) - default is no timeout.</p> <p>Last point... odds are that no matter how your feeding is configured, the pipeline will error out on some documents. You'll need to be prepared to either accept that or refeed just the metadata (there are other options, but kinda outside the scope here).</p> <p>good luck.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    1. COI'm sorry it took me so long to get back to you. This is a single ndoe install. I have yet to find out the exact number of documents, but your gerenal approach to this problem is fantastic. I suppose you have real-world experience with this? :) Thank again for this.
      singulars
    2. COHi, no worries about the delay... I'm not exactly quick to reply myself. Yeah, plenty of real-world... worked for Fast pre-MS, and work with it as my daily job. If it's just about getting the files in (no external metadata) might want to see if you're licensed for the filetraverser; best feeder fast ever produced (ironically written in python) but handles all the batching & retries like a champ. With a single node install things are a bit easier to handle... if it's a one time push of 60Gb, even simpler (vs 60Gb/week). When it works well, fast/esp is great... but can get tripped up easily.
      singulars
    3. COHi, somehow stackoverflow does not inform me of new comments, but anyway. Right now we are using a component that is hand written in C# to slow down the dataflow to the document dispatcher/docproc because the documents come in so fast that the index gets shut down (as far as i understand it needs to reindex the data which takes about 8h). so they slow it down to a maximum #of docs or MB using a bunch of threads. Each threads then waits for an async callback from the FAST API to notify the caller. At this point we are still investigating whether this is actually neccessary. Your thoughts?
      singulars
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload