StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>My thoughts only -- not a complete solution; not tested in practice but still may touch on a number of interesting problems and potential solutions.</p> <p>Standardised time for node failure and rejoining must be recorded and managed. To achieve this the network does not calculate on real-time basis but on an animation frame number basis. Have N front-end processors assigning FEP ID and job ID and network animation frame number to incoming jobs. There are a number of issues with real-time that are not quite addressed with quantizing time even; in some exception cases, its a bit like in accounting, posting events to when they should be regarded as occuring rather than when any cash moves.</p> <p>For high performance, the heartbeat packets must also contain details of jobs being performed and recently completed or abandoned as well as the inventory of hosts in the network.</p> <p>Network proceeds to process work items and publish their results to adjacent peers or FEPs. FEPs forward completed job details to clients, and can take over for failed FEPs as only state in an FEP is the last serial number stamped on a request.</p> <p>Network must have a quorum to continue. External monitors track connectivity and inform the nodes which experience changes in connectivity whether they are now within or outside the quorum.</p> <p>Where a work item is not completed by a machine because it fails, or a new node joins the network, a new work allocation policy must be established based on work item ID to allocate the work to the remaining nodes, until the new node comes back online.</p> <p>For cases where multiple nodes perform the same job (duplication of effort - which is possible but minimised by designing the usual timeouts sensibly) the jobs must be rollbackable, and the conflict resolved using Markov Chains.</p> <p>To detect the possible duplications reliably jobs must auto-rollback in less time than the timeout for receiving job results that applies during a <em>crisis period</em> ie when nodes are failing. A shorter timeout applies when nodes are not failing.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload