Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>There isn't one, but they can <a href="http://technosophos.com/2009/05/29/set-objects-php-arrays-vs-splobjectstorage.html" rel="nofollow">be emulated</a>.</p> <p>Here is a achieve copy before the link died.. all the contents</p> <p>A Set of Objects in <code>PHP: Arrays</code> vs. <code>SplObjectStorage</code></p> <p>One of my projects, QueryPath, performs many tasks that require maintaining a set of unique objects. In my quest to optimize QueryPath, I have been looking into various ways of efficiently storing sets of objects in a way that provides expedient containment checks. In other words, I want a data structure that keeps a list of unique objects, and can quickly tell me if some object is present in that list. The ability to loop through the contents of the list is also necessary.</p> <p>Recently I narrowed the list of candidates down to two methods:</p> <p>Use good old fashioned arrays to emulate a hash set. Use the <code>SPLObjectStorage</code> system present in PHP 5.2 and up. Before implementing anything directly in <code>QueryPath</code>, I first set out designing the two methods, and then ran some micro-benchmarks (with Crell's help) on the pair of methods. To say that the results were surprising is an understatement. The benchmarks will likely change the way I structure future code, both inside and outside of <code>Drupal</code>.</p> <p><strong>The Designs</strong></p> <p>Before presenting the benchmarks, I want to quickly explain the two designs that I settled on.</p> <p><strong>Arrays emulating a hash set</strong></p> <p>The first method I have been considering is using PHP's standard array() to emulate a set backed by a hash mapping (a "hash set"). A set is a data structure designed to keep a list of unique elements. In my case, I am interested in storing a unique set of DOM objects. A hash set is a set that is implemented using a hash table, where the key is a unique identifier for the stored value. While one would normally write a class to encapsulate this functionality, I decided to test the implementation as a bare array with no layers of indirection on top. In other words, what I am about to present are the internals of what would be a hash set implementation.</p> <p>The Goal: Store a (unique) set of objects in a way that makes them (a) easy to iterate, and (b) cheap to check membership. The Strategy: Create an associative array where the key is a hash ID and the value is the object.</p> <p>With a reasonably good hashing function, the strategy outlined above should work as desired.</p> <p>"Reasonably good hashing function" -- that was the first gotcha. How do you generate a good hashing function for an object like <code>DOMDocument</code>? One (bad) way would be to serialize the object and then, perhaps, take its MD5 hash. That, however, will not work on many objects -- specifically any object that cannot serialze. The <code>DOMDocument</code>, for example, is actually backed by a resource and cannot be serialized.</p> <p>One needed look far for a such a function, though. It turns out that there is an object hashing function in PHP 5. It's called <code>spl_object_hash()</code>, and it can take any object (even one that is not native PHP) and generate a hashcode for it.</p> <p>Using <code>spl_object_hash()</code> we can build a simple data structure that functions like a hash set. This structure looks something like this:</p> <pre><code>array( $hashcode =&gt; $object ); </code></pre> <p>For example, we an generate an entry like so:</p> <pre><code>$object = new stdClass(); $hashcode = spl_object_hash($object); $arr = array( $hashcode =&gt; $object ); </code></pre> <p>In the example above, then, the hashcode string is an array key, and the object itself is the array value. Note that since the hashcode will be the same each time an object is re-hashed, it serves not only as a comparison point ("if object a's hashkey == object b's hashkey, then a == b"), it also functions as a uniqueness constraint. Only one object with the specified hashcode can exist per array, so there is no possibility of two copies (actually, two references) to the same object being placed in the array.</p> <p>With a data structure like this, we have a host of readily available functions for manipulating the structure, since we have at our disposal all of the PHP array functions. So to some degree this is an attractive choice out of the box.</p> <p>The most import task, in our context at least, is that of determining whether an entry exists inside of the set. There are two possible candidates for this check, and both require supplying the hashcode:</p> <p>Check whether the key exists using array_key_exists(). Check whether the key is set using isset(). To cut to the chase, isset() is faster than array_key_exists(), and offers the same features in our context, so we will use that. (The fact that they handle null values differently makes no difference to us. No null values will ever be inserted into the set.)</p> <p>With this in mind, then, we would perform a containment check using something like this:</p> <pre><code>$object = new stdClass(); $hashkey = spl_object_hash($object); // ... // Check whether $arr has the $object. if (isset($arr[$hashkey])) { // Do something... } </code></pre> <p>Again, using an array that emulates a hash set allows us to use all of the existing array functions and also provides easy iterability. We can easily drop this into a foreach loop and iterate the contents. Before looking at how this performs, though, let's look at the other possible solution.</p> <p><strong>Using SplObjectStorage</strong></p> <p>The second method under consideration makes use of the new <code>SplObjectStorage</code> class from PHP 5.2+ (it might be in 5.1). This class, which is backed by a C implementation, provides a set-like storage mechanism for classes. It enforces uniqueness; only one of each object can be stored. It is also traversable, as it implements the Iterable interface. That means you can use it in loops such as foreach. (On the down side, the version in PHP 5.2 does not provide any method of random access or index-based access. The version in PHP 5.3 rectifies this shortcoming.)</p> <p>The Goal: Store a (unique) set of objects in a way that makes them (a) easy to iterate, and (b) cheap to check membership. The Strategy: Instantiate an object of class <code>SplObjectStorage</code> and store references to objects inside of this.</p> <p>Creating a new SplObjectStorage is simple:</p> <pre><code>$objectStore = new SplObjectStorage(); </code></pre> <p>An <code>SplObjectStorage</code> instance not only retains uniqueness information about objects, but objects are also stored in predictable order. <code>SplObjectStorage</code> is a FIFO -- First In, First Out.</p> <p>Adding objects is done with the <code>attach()</code> method:</p> <pre><code>$objectStore = new SplObjectStorage(); $object = new stdClass(); $objectStore-&gt;attach($object); </code></pre> <p>It should be noted that attach will only attach an object once. If the same object is passed to <code>attach()</code> twice, the second attempt will simply be ignored. For this reason it is unnecessary to perform a <code>contains()</code> call before an <code>attach()</code> call. Doing so is redundant and costly.</p> <p>Checking for the existence of an object within an SplObjectStorage instance is also straightforward:</p> <pre><code>$objectStore = new SplObjectStorage(); $object = new stdClass(); $objectStore-&gt;attach($object); // ... if ($objectStore-&gt;contains($object)) { // Do something... } </code></pre> <p>While <code>SplObjectStorage</code> has nowhere near the number of supporting methods that one has access to with arrays, it allows for iteration and somewhat limited access to the objects stored within. In many use cases (including the one I am investigating here), <code>SplObjectStorage</code> provides the requisite functionality.</p> <p>Now that we have taken a look at the two candidate data structures, let's see how they perform.</p> <p><strong>The Comparisons</strong></p> <p>Anyone who has seen Larry (Crell) Garfield's micro-benchmarks for arrays and <code>SPL ArrayAccess</code> objects will likely come into this set of benchmarks with the same set of expectations Larry and I had. We expected PHP's arrays to blow the SplObjectStorage out of the water. After all, arrays are a primitive type in PHP, and have enjoyed years of optimizations. However, the documentation for the <code>SplObjectStorage</code> indicates that the search time for an <code>SplObjectStorage</code> object is O(1), which would certainly make it competitive if the base speed is similar to that of an array.</p> <p><strong>My testing environments are:</strong></p> <p>An iMac (current generation) with a 3.06 Ghz Intel Core 2 Duo and 2G of 800mhz DDR2 RAM. MAMP 1.72 (PHP 5.2.5) provides the AMP stack. A MacBook Pro with a 2.4 Ghz Intel Core 2 Duo and 4G of 667mhz DDR2 RAM. MAMP 1.72 (PHP 5.2.5) provides the AMP stack. In both cases, the performance tests averaged about the same. Benchmarks in this article come from the second system.</p> <p>Our basic testing strategy was to build a simple test that captured information about three things:</p> <p>The amount of time it takes to load the data structure The amount of time it takes to seek the data structure The amount of memory the data structure uses We did our best to minimize the influence of other factors on the test. Here is our testing script:</p> <pre><code> &lt;?php /** * Object hashing tests. */ $sos = new SplObjectStorage(); $docs = array(); $iterations = 100000; for ($i = 0; $i &lt; $iterations; ++$i) { $doc = new DOMDocument(); //$doc = new stdClass(); $docs[] = $doc; } $start = $finis = 0; $mem_empty = memory_get_usage(); // Load the SplObjectStorage $start = microtime(TRUE); foreach ($docs as $d) { $sos-&gt;attach($d); } $finis = microtime(TRUE); $time_to_fill = $finis - $start; // Check membership on the object storage $start = microtime(FALSE); foreach ($docs as $d) { $sos-&gt;contains($d); } $finis = microtime(FALSE); $time_to_check = $finis - $start; $mem_spl = memory_get_usage(); $mem_used = $mem_spl - $mem_empty; printf("SplObjectStorage:\nTime to fill: %0.12f.\nTime to check: %0.12f.\nMemory: %d\n\n", $time_to_fill, $time_to_check, $mem_used); unset($sos); $mem_empty = memory_get_usage(); // Test arrays: $start = microtime(TRUE); $arr = array(); // Load the array foreach ($docs as $d) { $arr[spl_object_hash($d)] = $d; } $finis = microtime(TRUE); $time_to_fill = $finis - $start; // Check membership on the array $start = microtime(FALSE); foreach ($docs as $d) { //$arr[spl_object_hash($d)]; isset($arr[spl_object_hash($d)]); } $finis = microtime(FALSE); $time_to_check = $finis - $start; $mem_arr = memory_get_usage(); $mem_used = $mem_arr - $mem_empty; printf("Arrays:\nTime to fill: %0.12f.\nTime to check: %0.12f.\nMemory: %d\n\n", $time_to_fill, $time_to_check, $mem_used); ?&gt; </code></pre> <p>The test above is broken into four separate tests. The first two test how well the <code>SplObjectStorage</code> method handles loading and containment checking. The second two perform the same test on our improvised array data structure.</p> <p>There are two things worth noting about the test above.</p> <p>First, the object of choice for our test was a <code>DOMDocument</code>. There are a few reasons for this. The obvious reason is that this test was done with the intent of optimizing <code>QueryPath</code>, which works with elements from the DOM implementation. There are two other interesting reasons, though. One is that DOMDocuments are not lightweight. The other is that <code>DOMDocuments</code> are backed by a C implementation, making them one of the more difficult cases when storing objects. (They cannot, for example, be conveniently serialized.)</p> <p>That said, after observing the outcome, we repeated the test with basic <code>stdClass</code> objects and found the performance results to be nearly identical, and the memory usage to be proportional.</p> <p>The second thing worth mention is that we used 100,000 iterations to test. This was about the upper bound that my PHP configuration allowed before running out of memory. Other than that, though, the number was chosen arbitrarily. When I ran tests with lower iteration counts, the <code>SplObjectStorage</code> definitely scaled linearly. Array performance was less predictable (larger standard deviation) with smaller data sets, though it seemed to average around the same for lower sizes as it does (more predictably) for larger sized arrays.</p> <p><strong>The Results</strong></p> <p>So how did these two strategies fare in our micro-benchmarks? Here is a representative sample of the output generated when running the above:</p> <pre><code>SplObjectStorage: Time to fill: 0.085041999817. Time to check: 0.073099000000. Memory: 6124624 Arrays: Time to fill: 0.193022966385. Time to check: 0.153498000000. Memory: 8524352 </code></pre> <p>Averaging this over multiple runs, <code>SplObjectStorage</code> executed both fill and check functions twice as fast as the array method presented above. We tried various permutations of the tests above. Here, for example, are results for the same test with a <code>stdClass object</code>:</p> <pre><code>SplObjectStorage: Time to fill: 0.082209110260. Time to check: 0.070617000000. Memory: 6124624 Arrays: Time to fill: 0.189271926880. Time to check: 0.152644000000. Memory: 8524360 </code></pre> <p>Not much different. Even adding arbitrary data to the object we stored does not make a difference in the time it takes for the <code>SplObjectStorage</code> (though it does seem to raise the time ever so slightly for the array).</p> <p>Our conclusion is that <code>SplObjectStorage</code> is indeed a better solution for storing lots of objects in a set. Over the last week, I've ported <code>QueryPath</code> to <code>SplObjectStorage</code> (see the Quark branch at GitHub -- the existing <code>Drupal</code> <code>QueryPath</code> module can use this experimental branch without alteration), and will likely continue benchmarking. But preliminary results seem to provide a clear indication as to the best approach.</p> <p>As a result of these findings, I'm much less inclined to default to arrays as "the best choice" simply because they are basic data types. If the <code>SPL</code> library contains features that out-perform arrays, they should be used when appropriate. From QueryPath to my Drupal modules, I expect that my code will be impacted by these findings.</p> <p>Thanks to Crell for his help, and for Eddie at Frameweld for sparking my examination of these two methods in the first place.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload