Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>I have a hard time believing that any script without any prior knowledge of the data (unlike MySql which has such info pre-loaded), would be faster than a SQL approach.</p> <p>Aside from the time spent parsing the input, the script needs to "keep" sorting the order by array etc...</p> <p>The following is a first guess at what should work decently fast in SQL, assuming a index (*) on the table's aa, bb, cc columns, in that order. (A possible alternative would be an "aa, bb DESC, cc" index</p> <p>(*) This index could be clustered or not, not affecting the following query. Choice of clustering or not, and of needing an "aa,bb,cc" separate index depends on use case, on the size of the rows in table etc. etc.</p> <pre><code>SELECT T1.aa, T1.bb, T1.cc , COUNT(*) FROM tblAbc T1 LEFT OUTER JOIN tblAbc T2 ON T1.aa = T2.aa AND (T1.bb &lt; T2.bb OR(T1.bb = T2.bb AND T1.cc &lt; T2.cc)) GROUP BY T1.aa, T1.bb, T1.cc HAVING COUNT(*) &lt; 5 -- trick, remember COUNT(*) goes 1,1,2,3,... ORDER BY T1.aa, T1.bb, T1.cc, COUNT(*) DESC </code></pre> <p>The idea is to get a count of how many records, within a given aa value are smaller than self. There is a small trick however: we need to use LEFT OUTER join, lest we discard the record with the biggest bb value or the last one (which may happen to be one of the top 5). As a result of left joining it, the COUNT(*) value counts 1, 1, 2, 3, 4 etc. and the HAVING test therefore is "&lt;5" to effectively pick the top 5.</p> <p>To emulate the sample output of the OP, the ORDER BY uses DESC on the COUNT(), which could be removed to get a more traditional top 5 type of listing. Also, the COUNT() in the select list can be removed if so desired, this doesn't impact the logic of the query and the ability to properly sort.</p> <p>Also note that this query is deterministic in terms of the dealing with ties, i,e, when a given set of records have a same value for bb (within an aa group); I think the Python program may provide slightly different outputs when the order of the input data is changed, that is because of its occasional truncating of the sorting dictionary.</p> <p><strong>Real solution: A SQL-based procedural approach</strong></p> <p>The self-join approach described above demonstrates how declarative statements can be used to express the OP's requirement. However this approach is naive in a sense that its performance is roughly bound to the sum of the squares of record counts within each aa 'category'. (not O(n^2) but roughly O((n/a)^2) where a is the number of different values for the aa column) In other words it performs well with data such that on average the number of records associated with a given aa value doesn't exceed a few dozens. If the data is such that the aa column is not selective, the following approach is much -much!- better suited. It leverages SQL's efficient sorting framework, while implementing a simple algorithm that would be hard to express in declarative fashion. This approach could further be improved for datasets with particularly huge number of records each/most aa 'categories' by introducing a simple binary search of the next aa value, by looking ahead (and sometimes back...) in the cursor. For cases where the number of aa 'categories' relative to the overall row count in tblAbc is low, see yet another approach, after this next one.</p> <pre><code>DECLARE @aa AS VARCHAR(10), @bb AS INT, @cc AS VARCHAR(10) DECLARE @curAa AS VARCHAR(10) DECLARE @Ctr AS INT DROP TABLE tblResults; CREATE TABLE tblResults ( aa VARCHAR(10), bb INT, cc VARCHAR(10) ); DECLARE abcCursor CURSOR FOR SELECT aa, bb, cc FROM tblABC ORDER BY aa, bb DESC, cc FOR READ ONLY; OPEN abcCursor; SET @curAa = '' FETCH NEXT FROM abcCursor INTO @aa, @bb, @cc; WHILE @@FETCH_STATUS = 0 BEGIN IF @curAa &lt;&gt; @aa BEGIN SET @Ctr = 0 SET @curAa = @aa END IF @Ctr &lt; 5 BEGIN SET @Ctr = @Ctr + 1; INSERT tblResults VALUES(@aa, @bb, @cc); END FETCH NEXT FROM AbcCursor INTO @aa, @bb, @cc; END; CLOSE abcCursor; DEALLOCATE abcCursor; SELECT * from tblResults ORDER BY aa, bb, cc -- OR .. bb DESC ... for a more traditional order. </code></pre> <p><strong>Alternative to the above</strong> for cases when aa is very unselective. In other words, when we have relatively few aa 'categories'. The idea is to go through the list of distinct categories and to run a "LIMIT" (MySql) "TOP" (MSSQL) query for each of these values. For reference purposes, the following ran in 63 seconds for tblAbc of 61 Million records divided in 45 aa values, on MSSQL 8.0, on a relatively old/weak host.</p> <pre><code>DECLARE @aa AS VARCHAR(10) DECLARE @aaCount INT DROP TABLE tblResults; CREATE TABLE tblResults ( aa VARCHAR(10), bb INT, cc VARCHAR(10) ); DECLARE aaCountCursor CURSOR FOR SELECT aa, COUNT(*) FROM tblABC GROUP BY aa ORDER BY aa FOR READ ONLY; OPEN aaCountCursor; FETCH NEXT FROM aaCountCursor INTO @aa, @aaCount WHILE @@FETCH_STATUS = 0 BEGIN INSERT tblResults SELECT TOP 5 aa, bb, cc FROM tblproh WHERE aa = @aa ORDER BY aa, bb DESC, cc FETCH NEXT FROM aaCountCursor INTO @aa, @aaCount; END; CLOSE aaCountCursor DEALLOCATE aaCountCursor SELECT * from tblResults ORDER BY aa, bb, cc -- OR .. bb DESC ... for a more traditional order. </code></pre> <p><strong>On the question of needing an index or not</strong>. (cf OP's remark) When merely running a "SELECT * FROM myTable", a table scan is effectively the fastest appraoch, no need to bother with indexes. However, the main reason why SQL is typically better suited for this kind of things (aside from being the repository where the data has been accumulating in the first place, whereas any external solution needs to account for the time to export the relevant data), is that it can rely on indexes to avoid scanning. Many general purpose languages are far better suited to handle raw processing, but they are fighting an unfair battle with SQL because they need to rebuilt any prior knowledge of the data which SQL has gathered in the course of its data collection / import phase. Since sorting is a typically a time and sometimes space consuming task, SQL and its relatively slower processing power often ends up ahead of alternative solutions.</p> <p>Also, even without pre-built indexes, modern query optimizers may decide on a plan that involves the creation of a temporary index. And, because sorting is an intrinsic part of DDMS, the SQL servers are generally efficient in that area.</p> <p><strong>So... Is SQL better?</strong></p> <p>This said, if we are trying to compare SQL and other languages for pure ETL jobs, i.e. for dealing with <em>heaps</em> (unindexed tables) as its input to perform various transformations and filtering, it is likely that multi-thread-able utilities written in say C, and leveraging efficient sorting libaries, would likely be faster. The determining question to decide on a SQL vs. Non-SQL approach is where the data is located and where should it eventually reside. If we merely to convert a file to be supplied down "the chain" external programs are better suited. If we have or need the data in a SQL server, there are only rare cases that make it worthwhile exporting and processing externally.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload