google bigquery - Parallelizable OVER EACH BY -
i hitting obstacle again , again...
join each , group each clauses can't used on output of window functions
is there best practice or recommendations how use window functions (over()) large data sets cannot processed on single node?
fragmenting data , running same query different filters can work, limiting, takes lot of time (and manual labor) , costly (running same query on same data set 30 times instead of once).
referring jeremy's answer bellow... it's better, still doesn't work properly. if take original query sample:
select title,count (case when contributor_id<>leadcontributor 1 else null end) different, count (case when contributor_id=leadcontributor 1 else null end) same, count(*) total ( select title,contributor_id,lead(contributor_id)over(partition title order timestamp) leadcontributor [publicdata:samples.wikipedia] regexp_match(title,r'^[a,b]')=true ) group title
now works... but
select title,count (case when contributor_id<>leadcontributor 1 else null end) different, count (case when contributor_id=leadcontributor 1 else null end) same, count(*) total ( select title,contributor_id,lead(contributor_id)over(partition title order timestamp) leadcontributor [publicdata:samples.wikipedia] regexp_match(title,r'^[a-z]')=true ) group each title
gives again resources exceeded error...
window functions can executed in distributed fashion according partition clause given inside over. if supply partition window functions, data processed in parallel similar how join each , group each processed.
in addition, can use partition on output of join each or group each without serializing execution. using same keys partition join each or group each particularly efficient, because data not need reshuffled between join/aggregation , window function execution.
Comments
Post a Comment