Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>It sounds like you want a binary delta or perhaps an index derived from the application of a binary delta (like it's size). You could then compare this index to some baseline that you determine experimentally to decide if it's a "clone" or not.</p> <p>There are a lot of similarities between compression and delta creation, so I'd say you aren't far off with your current implementation.</p> <p>That being said, pairwise comparison of every binary file in your database is probably prohibitively expensive (O(n<sup>2</sup>), I think). I would try to find a simple hash for identifying possible candidates for comparison. Something conceptually similar to what spdenne and Eduard are suggesting. That is, find a hash which can be applied to every item once, sort that list and then use a finer grained comparison on items whose hashes are close together in the list.</p> <p>Constructing hashes useful for the general case has been an actively pursued research topic in CS for several years. The <a href="http://lshkit.sourceforge.net/" rel="noreferrer">LSHKit</a> software library implements some algorithms of this sort. The internet accessible paper <a href="http://www.usenix.org/publications/library/proceedings/sf94/full_papers/manber.finding" rel="noreferrer">FINDING SIMILAR FILES IN A LARGE FILE SYSTEM</a> seems like it might be targeted more at comparing text files but might be useful to you. The more recent paper <a href="http://www.sciencedirect.com/science?_ob=ArticleURL&amp;_udi=B7CW4-4P06CJD-7&amp;_user=1617289&amp;_rdoc=1&amp;_fmt=&amp;_orig=search&amp;_sort=d&amp;view=c&amp;_acct=C000053976&amp;_version=1&amp;_urlVersion=0&amp;_userid=1617289&amp;md5=335d7e5e9bf27f9f42bf6a3cf493e44f" rel="noreferrer">Multi-resolution similarity hashing</a> describes a more powerful algorithm. It doesn't appear to be accessible without a subscription, though. You probably want to keep the wikipedia article on <a href="http://en.wikipedia.org/wiki/Locality_sensitive_hashing" rel="noreferrer">Locality Sensitive Hashing</a> handy as you browse the other resources. They all get pretty technical and the wikipedia entry itself is pretty math heavy. As a more user-friendly alternative you might be able to apply some ideas (or even executables) from the field of <a href="http://en.wikipedia.org/wiki/Acoustic_fingerprint" rel="noreferrer">Acoustic Fingerprinting</a>.</p> <p>If you're willing to abandon the general case it's likely that you can find a much simpler (and faster) domain-specific hash function that works just for your ROMs. Possibly something involving the placement of standard, or common, byte sequences and the value of select bits near them. I don't really know much about your binary format but I'm imagining things that signal the start of sections in the file like regions for sound, images or text. Binary formats frequently store the addresses of these sorts of sections near the beginning of the file. Some also use a chaining mechanism that stores the address of the first section at a known location along with it's size. This allows you to move to the next section which also contains a size, etc. A little investigation will probably allow you to discover any relevant formatting, if you're not already aware of it, and should put you well on your way to constructing a useful hash.</p> <p>If the hash functions don't get you all the way (or they require input of some sort to define a metric/distance) then there are several binary delta algorithms and implementations available on the web. The one I'm most familiar with is used by the subversion version control system. It uses a binary delta algorithm called xdelta to efficiently store binary file revisions. Here's a link directly to the file in their repository that implements it: <a href="http://svn.collab.net/repos/svn/trunk/subversion/libsvn_delta/xdelta.c" rel="noreferrer">xdelta.c</a>. There's probably a tool on the web that makes this more accessible as well.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload