sql - The fastest way to fill table using data from other tables -

i've read conception behind functional programing , makes me reconsider way of doing things.

for example, there table:

 - client, date, trial, full  - client1, 14.11.2012, 1, 1  - client1, 06.02.2013, null, 1  - client1, 27.03.2013, null, 1  - client1, 15.05.2013, null, 1

the table contains millions records , half million clients. goal transform data status of client:

 - client, date, status  - client1, 14.11.2012, 'mixed'  - client1, 01.12.2012, 'unprocessed'  - client1, 01.01.2012, 'unprocessed'  - client1, 13.01.2013, 'slept'  - client1, 01.02.2013, 'slept'  - client1, 06.02.2013, 'processed'  - client1, 01.03.2013, 'unprocessed'  - client1, 27.03.2013, 'processed'  - client1, 01.04.2013, 'unprocessed'  - client1, 01.05.2013, 'unprocessed'  - client1, 15.05.2013, 'processed'  - client1, 01.06.2013, 'unprocessed'  - client1, 01.07.2013, 'unprocessed'  - client1, 23.07.2013, 'slept'  - client1, 01.08.2013, 'slept'  - client1, 01.09.2013, 'slept'  - client1, 01.10.2013, 'slept'  - client1, 01.11.2013, 'slept'  - client1, 01.12.2013, 'slept'  - client1, 01.01.2014, 'slept'  - client1, 10.01.2014, 'left'

the short algorithm of transformation is:

if it's first row , trial = 1 , full = 1 status = 'mixed'
if there no data first day of month status = 'unprocessed'
if 60 days passed , there no records containing full = 1 status = 'slept'
if 240 days passed , there no records containing full = 1 status = 'left'
if there first day of month , previous status = 'slept' status = 'slept

there lot of cases skipped, because algorithm isn't issue, tools.

in order transform data within sql used following expressions:

row_number() on (partition [client] order [date] asc)
lag([date],1) on (partition [client] order [date] desc)
dateadd(day,1,eomonth([date]))
recursion
etc

i have feeling can't fastest way transform data, multi-treading (put every client in separate tread) may helpful, not sure how sql @ that. execution plan huge after big number of cases.

so, question tool best transform data that? probably, programming language can handle way better?

update: prepared requested sql code. feel free find issue: http://pastebin.com/3ncdfqug

the below assumes you'll process 1 client @ time, example customer report

i have used dataset provided, uploaded table called clientdata following index applied may overkill created duplicate of data, makes things lightening quick:

create nonclustered index ix_cientdata_client_date on dbo.clientdata(client,date) include (trial,[full])

i have created dates table based on given client id, first date value lesser of last date value + 240 days or today.

from table, can filter out useless dates. join dataset previous clientdata row , process status logic.

as have not included entire set of logical processes have completed can, leaving in error messages if start changing things. find useful in pinpointing why case statement isn't quite doing want to:

if object_id('tempdb..#clientjourney') not null drop table #clientjourney  declare @client nvarchar(50) = '0x802b52540027e50211e24949c409c617'  declare @mindate date = (select min(date)                         clientdata                         client = @client                         ) declare @maxdate date = (select case when dateadd(d,240,max(date)) > getdate()                                     getdate()                                     else dateadd(d,240,max(date))                                     end                         clientdata                         client = @client                         )  --select max(date), @mindate,@maxdate, datediff(d,max(date),@maxdate) clientdata client = @client   -- create table of dates between @mindate , @maxdate recursive cte ;with dates ( select @mindate datevalue         ,case when datepart(day,@mindate) = 1 1 else 0 end monthstart  union  select dateadd(d,1,datevalue)         ,case when datepart(day,dateadd(d,1,datevalue)) = 1 1 else 0 end monthstart dates datevalue < @maxdate ) -- exclude aren't either first of month, in clientdata table or @maxdate value select row_number() on (order datevalue) rownum         ,d.datevalue         ,d.monthstart         ,c.trial         ,c.[full] #clientjourney dates d     left join clientdata c         on(d.datevalue = c.date             , c.client = @client             ) d.monthstart = 1     or c.date not null     or d.datevalue = @maxdate option (maxrecursion 0)   -- pull data out, joining previous item of clientdata , process status select j.rownum         ,j.datevalue         ,j.monthstart         ,j.trial         ,j.[full]          -- handling of first line in dataset         ,case when j.rownum = 1             case when j.trial not null                             , j.[full] not null                         'mixed'                     when j.trial null                             , j.[full] not null                         'full'                     when j.trial not null                             , j.[full] null                         'trial'                     else 'error1'                     end              -- handling rest of dataset             else case when j.monthstart = 1                                                 -- first of month                         case when j.trial not null                                      -- client data                                         or j.[full] not null                                     'processed'                                  when j.trial null                                            -- without client data                                         , j.[full] null                                     case when datediff(d,jp.datevalue,j.datevalue) < 60        -- less 60 days                                                     'unprocessed'                                                 when datediff(d,jp.datevalue,j.datevalue) < 240     -- less 240 days                                                     'slept'                                                 else 'left'                                                 end                                 else 'error2'                                 end                     else                                                                    -- rest of month                         case when j.[full] = 1                                                  -- full flag                                     'processed'                                  when j.[full] null                                           -- without full flag                                     case when datediff(d,jp.datevalue,j.datevalue) < 60            -- less 60 days                                                     'unprocessed'                                                 when datediff(d,jp.datevalue,j.datevalue) < 240         -- less 240 days                                                     'slept'                                                 else 'left'                                                 end                                 else 'error3'                                 end                      end             end status         ,jp.datevalue         ,datediff(d,jp.datevalue,j.datevalue) lastfull #clientjourney j     outer apply (select top 1 datevalue             -- returns recent clientdata row occured before 1 being selected                     #clientjourney j2                     j.rownum > j2.rownum                         , j2.[full] not null                     order datevalue desc                 ) jp   -- clean if object_id('tempdb..#clientjourney') not null drop table #clientjourney

Search This Blog

Living

sql - The fastest way to fill table using data from other tables -

Comments

Post a Comment

Popular posts from this blog

elasticsearch python client - work with many nodes - how to work with sniffer -

unity3d - Rotate an object to face an opposite direction -

angular - Is it possible to get native element for formControl? -