StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<h2>Premises</h2> <ul> <li>the mean of a set of data has the mathematical property that the sum of the deviations from the mean is 0. <ul> <li>this explains why both bad and good datasets alwais give almost 0.</li> <li>basically the result when differs from zero is essentially an accumulations of rounding errors in the diffs and that's why unfortunately cannot hold useful informations</li> </ul></li> <li>the thing that most clearly define what you're looking for is your image: you're looking for an <strong>AREA</strong> and this is why you're not finding the solution in this ways: <ul> <li>looking to a metric in the single points is too local to extract that information</li> <li>looking to global accumulations or parameters (global standard deviation) is too global and you lose the data among too much information and source of variations</li> <li>kurtosis (you've already told I know but is for completeness) is out of its field of applications since this is not a probability distribution</li> <li>in the end the more suitable approach of your already tryied ones is the "Homemade dip detector" because thinks in a way that is local but not too much.</li> </ul></li> <li>Last but not least: <ul> <li>Any Algorithm you're going to choose has its tacit points on which it stands. <ul> <li>So maybe one is looking for a super clever algorithm that with no parametrization and tuning automatically adapts to the problem and self define thereshods and other.</li> <li>On the other side there is an algorithm that will stand on the knowledge by the writer of the tipical data behavior (good and bad) and that is itself stupid in the way that if there is another different and unespected behavior the results are unpredictable</li> <li>Ok, the right way is one of this two or is in-between them depending on the application. So if it works also the "Homemade dip detectors" can be a solution. There is not reason to define it crude but it could be that is not sufficient based on applicaton needs and that's an other thing.</li> </ul></li> </ul></li> </ul> <h2>How to find the area</h2> <ul> <li>Once you have the data the first thing is to clearly define the "theoretical straight line". I give some options: <ul> <li>use RANSAC algorithm (<strong>formally</strong> the best option IMHO) <ul> <li>this give you the best fit to the aligned points disregarding the not aligned ones</li> <li>it is quite difficult and maybe oversized for this work (IMHO)</li> </ul></li> <li>consider the line defined by the first and last point <ul> <li>you told that the dip is almost always in the same position that is not near boundaries so first and last points can be thought as affordable</li> <li>very easy to implement</li> <li>this is an example of using the knowledge about expected behaviors as I told before so you need to think if and how much confidence you give to this assumption</li> </ul></li> <li>consider a linear fit to the first 10 points and last 10 points <ul> <li>is only a more affordable version of previous since using more points you can be less worried that maybe just the first point or the last were affected by any measure problem and so all fails because of this</li> <li>also quite easy to implement</li> <li>if I were you I will use this or something inspired to this</li> </ul></li> </ul></li> <li>calculate the Y value given by the straight line for each X</li> <li>calculate the area between the two curves (or the areas under the function <code>Y_dev = Y_data - Y_straight</code> that is mathematically the same) with this procedure: <ul> <li><code>PositiveMax = 0; NegativeMax = 0;</code></li> <li>start from first point (value can be positive or negative) and put in a temporary area accumulator <code>tmp_Area</code></li> <li>for each next point <ul> <li>if the sign is the same then accumulate the value</li> <li>if it is different <ul> <li>stop accumulating</li> <li>check if the accumulated value is the greater than PositiveMax or below NegativeMax and if it is than store as new PositiveMax or NegativeMax</li> <li>in any case reset the accumulator with <code>tmp_Area = Y_dev;</code> to the current value starting this way a new accumulation</li> </ul></li> </ul></li> <li>in the end you will have the values of the maximum overvalued contiguous area and maximum undervalued contiguous area that I think are the scores you're looking for.</li> <li>if you want you can only manage the NegativeMax based on observed and expected data behaviors</li> <li>you may find useful to put a thereshold so that if a value <code>Y_dev</code> is lower than the thereshold you do not accumulate it.</li> <li>this in order to not obtain large accumulations from many points close to the straight line that can be similar to the accumulations of few points far from the line <ul> <li>the need of this and and the proper thereshold needs to be evaluated on some sample data</li> </ul></li> <li>you need to find an appropriate thereshold for this contiguous area and you can have it only from observation of sample data. <ul> <li>again: it can be you observing and deciding the thereshold or you can build a repository of good and bad samples and write a program that automatically learn which thereshold to use. But his is not the algorithm, this is how to find its operative parameters and there is nothing wrong to do by human brain.. ..it only depends if we're looking for a method to separate bad and good things or if we're looking for and autoadaptive algorithm that does this.. ..you decide the target.</li> </ul></li> </ul></li> </ul>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload