Note that there are some explanatory texts on larger screens.

plurals
  1. POCalculating the probability of system failure in a distributed network
    primarykey
    data
    text
    <p>I am trying to build a mathematical model of the availability of a file in a distributed file-system. I posted this question at MathOverflow but this might as well be classified as a CS-question so I give it a shot here as well. </p> <p>The system works like this: a node stores a file (encoed using erasure codes) at r*b remotes nodes, where r is the replication-factor and b is an integer constant. Erasure-coded files have the property that the file can be restored iff at least b of the remote nodes are available and return its part of the file.</p> <p>The simplest approach to this is to assume that all remote nodes are independent of each other and have the same availability p. With these assumptions the availability of a file follows the Binomial distribution, i.e. <a href="http://bit.ly/dyJwwE" rel="noreferrer">Binomial distribution http://bit.ly/dyJwwE</a></p> <p>Unfortunately these two assumptions can introduce a non-neligible error, as shown by this paper: <a href="http://deim.urv.cat/~lluis.pamies/uploads/Main/icpp09-paper.pdf" rel="noreferrer">http://deim.urv.cat/~lluis.pamies/uploads/Main/icpp09-paper.pdf</a>.</p> <p>One way to overcome the assumption that all nodes have the same availability is to calculate the probability of each possible combination of availaible/non-available node and take the sum of all these outcomes (which is sort of what they suggest in the paper above, just more formally than what I just described). You can see this approach as a binary tree with depth r*b and each leave is one possible combination of available/not-available nodes. The file's availability is the same thing as the probablity that you arrive at a leave with >=b available nodes. This approach is more correct but has a computational cost of <a href="http://bit.ly/cEZcAP" rel="noreferrer">Ordo http://bit.ly/cEZcAP</a>. Also, it doesn't deal with the assumption of node independence.</p> <p>Do you guys have any ideas of a good approximation which introduces less error than the binomial distribution-aproximation but with better computational cost than <a href="http://bit.ly/cEZcAP" rel="noreferrer">http://bit.ly/d52MM9 http://bit.ly/cEZcAP</a>? </p> <p>You can assume that the availability-data of each node is a set of tuples consisting of <code>(measurement-date, node measuring, node being measured, succes/failure-bit)</code>. With this data you could for example calculate the correlation of the availability between the nodes and the availability variance.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. COWhat do you mean by "node independence"? Are you talking about a graph that represents the node network and failures of certain key nodes may partition the graph into topologically distinct subnetworks that cannot communicate among each other? Or do you assume the possibility that individual node failures may cause other nodes to fail, too (e.g., because they might be virtual machines located on the same physical machine)? Without clarifying the nature of the correlation it is impossible to suggest any models.
      singulars
    2. COAs a follow-up to the previous question it is important to specify if the reconstruction of the file is possible (read: meaningful) if a copy is available on a subnetwork of nodes able to communicate among each other. Or if you require access from a "root-node" to a subnetwork of at least b nodes in order to restore the file in question.
      singulars
    3. COThe last paragraph introduces the `measurement-date` as an additional property. This introduces a timescale into the system, where previously the system was assumed to be static. Previously, a node was either alive (probability `p`) or dead (probability `1-p`). With a time-scale the system may no longer be static and a certain mean-time-between-failures (for switching `alive` nodes to `dead`) and mean-time-between-repairs (the reverse) may become meaningful. If you have this situation, the probability for restoring a file becomes time-dependent.
      singulars
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload