Note that there are some explanatory texts on larger screens.

plurals
  1. POMemory-efficient data structure for matrix in PHP
    primarykey
    data
    text
    <p>I run out of memory, using a 80k x 20 matrix (array) of integer values in PHP. Is there a solution?</p> <p><strong>Background</strong></p> <p>I have a PHP application that collects data and stores it to a database. The data is collected in different domains (>20k). The number of variables varies throughout the domains (principally unlimited), so I have to store comma-separated lists in my MySQL database (before version 5). This performs quite well.</p> <p>At some point of time, the user needs to download the data. The download feature must perform some normalization, therefore it needs the median (not average!) of each variable (actually the medians for a subset of the variables). Usually I can easily read the data from the database, explode() the comma-separated data and store the median-relevant data into an array[var][row]. Than i can sort() the arrays and I got the median.</p> <p>However, there is one domain that does not have 100 or 1000 data records (rows), but 80K. Given 20 median-relevant variables, this is 1.6M integer values (with 32bit) or 51 MB raw integer data (probably twice as much, because I am working on a 64 bit Linux machine). So far, so good - but the array structure has some overhead, so it becomes much larger than 128 MB. This is the point where my PHP runs out of memory.</p> <p><strong>What I do not want to do</strong></p> <p>Of course, I could just increase the memory limit per PHP script. For various reasons, I would like to avoid this.</p> <p>There also are algorithms that do not need to store n values to compute the median, but would be happy with n/2 (+x), but reducing the memory load to 50%+X may not be sufficient to solve the problem.</p> <p>I also could calculate the medians variable per variable. But that would require me to load 80K rows of data from the database 20 times and to perform the explode() again, and again. This would dramatically increase the script run time.</p> <p>[EDIT] The database is currently not normalized (using CSV data with each data row). This is intended and necessary for reasons of performance. Therefore I do not like to normalize the database, as this would result in a table with 100M entries and a giant index.</p> <p><strong>What I would like to do</strong></p> <p>We're talking of no more then 51 MB of raw 32bit integer values. Is there any change to reduce the overhead to a few percent? Maybe even on a 64bit machine?</p> <p>I know about the SPL extension that is available since PHP 5.0.0, but I did not yet find the solution how to save memory with this extension. Could anyone please give me a hint -- via SPL or using another solution (ideally available in PHP by default)?</p> <p><strong>Sample Code</strong></p> <pre><code>private function retrieveReferences() { $query = $this-&gt;getResultsQuery(true); $times = array(); $tp = -1; // Length of $times - 1 while ($row = $query-&gt;fetchArray()) { $timeSrc = explode(',', $row['times']); // Store the times per page foreach ($timeSrc as $p=&gt;$s) { // Should be faster than checking isset $times[$p] all the time while ($p &gt; $tp) { $times[] = array(); $tp = count($times) - 1; } $times[$p][] = (int)$s; } } // Compute median for each $times[$p] // &lt;snip&gt; } </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload