Ad

Theoretical - Fastest Way To Compute Different Counts

- 1 answer

I would like to know how experienced sql users would compute a lot of counts with different conditions. I have a table [population] and a table [sql_rules].

My [population] table looks something like this with ~1.7M rows and ~30 columns.

eye_colorhair_colorageoriginincome...
bluebrown36US40000...
greenyellow17UK60000...
brownblack42DE20000...
blackblack28DK80000...
..................

My [sql_rules] table looks something like this, with ~800 rows and at maximum 8 rules (average ~5) :

row_numberrule_1rule_2rule_3rule_4...
1hair_color = 'brown'age < 27origin IN ('US', 'UK')income >= 40000...
2hair_color = 'black'origin IN ('DK', 'FR')age < 10income >= 40000...
3hair_color = 'yellow'origin IN ('TH', 'PE')age > 34
4hair_color = 'black'age > 99origin IN ('US', 'UK')income >= 40000...
5age < 27income >= 100000
..................

What I need to do is basically to 'iterate' by row and to compute the count when applying the rules one after the other to obtain one 'count' by cell of my sql_rules table. Basically, since it is hard to explain with words, here are the count I would like to obtain :

row_numberrule_1rule_2rule_3rule_4...
1SELECT COUNT(*) FROM population WHERE hair_color = 'brown'SELECT COUNT(*) FROM population WHERE hair_color = 'brown' AND age < 27SELECT COUNT(*) FROM population WHERE hair_color = 'brown' AND age < 27 AND origin IN ('US', 'UK')SELECT COUNT(*) FROM population WHERE hair_color = 'brown' AND age < 27 AND origin IN ('US', 'UK') AND income >= 40000...
2SELECT COUNT(*) FROM population WHERE hair_color = 'black'SELECT COUNT(*) FROM population WHERE hair_color = 'black' AND origin IN ('DK', 'FR')SELECT COUNT(*) FROM population WHERE hair_color = 'black' AND origin IN ('DK', 'FR') AND age < 10SELECT COUNT(*) FROM population WHERE hair_color = 'black' AND origin IN ('DK', 'FR') AND age < 10 AND income >= 40000...
3...............
5SELECT COUNT(*) FROM population WHERE age < 27SELECT COUNT(*) FROM population WHERE age < 27 AND income >= 100000
..................

What I did for now was to use my table [sql_rules] to create all the SQL 'queries'. Since they can often be similar until the third or forth column, I avoided recomputing the same count multiple times and took the values if a particular 'count' had already been computed.

Can someone think of a fastest way to do this ? My solution is working fine but I can't think of a faster way to do this. I'm able to work with sql, Python, R.

Edit : Ideally, what I'm looking for is some nice ideas that could potentially accelerate the process, for the sake of 'interest'. Here are examples of ideas that are not meant to be great, but to exemplify what I am looking for :

  1. Using the sql_rules table, let's say with (~800 rows * ~5 rules in average), 4000 rules, create the 4000 sql queries and run them all one after the other until it's done.
  2. Create the 4000 'queries' as above but once it is done, keep only the unique/distinct rules which would maybe reduce the count to 1000 'SELECT COUNT(*) FROM... .
  3. loop over each line, create a duplicate of the population table and remove from the duplicate table the rows where the 'rules' are not met. That way, each condition/query/rules of the rest of the row will be computed on a smaller and smaller table.

Edit -------------------------------------

  • The table is unfortunately proprietary but I will create something equivalent using public data as soon as possible.
  • The sql_rules table is ~800rows and 8 columns (rules). Most of the time, one single row does not contain 8 rules and the a single row is thus looking like that :
row_numberrule_1rule_2rule_3rule_4rule_5rule_6rule_7rule_8
hair = 'blue'eyes IN ('green', 'brown')income BETWEEN 10000 AND 20000NANANANANANA

with the values I need to find for this particular row being equivalent to

row_numberrule_1rule_2rule_3rule_4rule_5rule_6rule_7rule_8
SELECT COUNT(*) FROM population WHERE hair = 'blue'SELECT COUNT(*) FROM population WHERE hair = 'blue' AND eyes IN ('green', 'brown')SELECT COUNT(*) FROM population WHERE hair = 'blue' AND eyes IN ('green', 'brown') AND income BETWEEN 10000 AND 20000000000
Ad

Answer

You can use conditional sum , still query must be build for each set of rules, but where part can be easily implemented.

select 
  sum(1) filter (where hair_color = 'brown') rule1,
  sum(1) filter (where age < 27) rule2,
  sum(1) filter (where origin in ('US','UK')) rule3
 from population

DB Fiddle

Ad
source: stackoverflow.com
Ad