Note that there are some explanatory texts on larger screens.

plurals
  1. PODetermining "Owner" of Text Edited by Multiple Users
    text
    copied!<p>You may have noticed that we now show an edit summary on Community Wiki posts:</p> <blockquote> <p>community wiki<br> 220 revisions, 48 users</p> </blockquote> <p>I'd like to also show the user who "most owns" the final content displayed on the page, as a percentage of the remaining text:</p> <blockquote> <p>community wiki<br> 220 revisions, 48 users<br> <strong>kronoz</strong> 87%</p> </blockquote> <p>Yes, there could be top (n) "owners", but for now I want the top 1.</p> <p>Assume you have this data structure, a list of user/text pairs ordered chronologically by the time of the post:</p> <pre> User Id Post-Text ------- --------- 12 The quick brown fox jumps over the lazy dog. 27 The quick brown fox jumps, sometimes. 30 I always see the speedy brown fox jumping over the lazy dog. </pre> <p><strong>Which of these users most "owns" the final text?</strong></p> <p>I'm looking for a reasonable algorithm -- it can be an approximation, it doesn't have to be perfect -- to determine the owner. Ideally expressed as a percentage score.</p> <p>Note that we need to factor in edits, deletions, and insertions, so the final result feels reasonable and right. You can use any stackoverflow post with a decent revision history (not just retagging, but frequent post body changes) as a test corpus. Here's a good one, with 15 revisions from 14 different authors. Who is the "owner"?</p> <p><a href="https://stackoverflow.com/revisions/327973/list">https://stackoverflow.com/revisions/327973/list</a></p> <p>Click "view source" to get the raw text of each revision.</p> <p>I should warn you that a pure algorithmic solution might end up being a form of the <a href="http://en.wikipedia.org/wiki/Longest_common_substring_problem" rel="nofollow noreferrer">Longest Common Substring Problem</a>. But as I mentioned, approximations and estimates are fine too if they work well.</p> <p><strong>Solutions in any language are welcome</strong>, but I prefer solutions that are</p> <ol> <li>Fairly easy to translate into c#.</li> <li>Free of dependencies. </li> <li>Put simplicity before efficiency.</li> </ol> <p>It is extraordinarily rare for a post on SO to have more than 25 revisions. But it should "feel" accurate, so if you eyeballed the edits you'd agree with the final decision. I encourage you to <strong>test your algorithm out on stack overflow posts with revision histories</strong> and see if you agree with the final output.</p> <hr> <p>I have now deployed the following approximation, which you can see in action for every <em>new</em> saved revision on Community Wiki posts</p> <ul> <li>do a <a href="http://www.mathertel.de/Diff/" rel="nofollow noreferrer">line based diff</a> of every revision where the body text changes</li> <li>sum the insertion and deletion lines for each revision as "editcount"</li> <li>each userid gets sum of "editcount" they contributed</li> <li>first revision author gets 2x * "editcount" as initial score, as a primary authorship bonus</li> <li>to determine final ownership percentage: each user's edited line count total divided by total number of edited lines in all revisions</li> </ul> <p>(There are also some guard clauses for common simple conditions like 1 revision, only 1 author, etcetera. The line-based diff makes it fairly speedy to recalc for all revisions; in a typical case of say 10 revisions it's ~50ms.)</p> <p>This works fairly well in my testing. It does break down a little when you have small 1 or 2 line posts that several people edit, but I think that's unavoidable. Accepting Joel Neely's answer as closest in spirit to what I went with, and upvoted everything else that seemed workable.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload