data.table - R converting to long format, pattern -
i convert data.table 1 wide format long.
set.seed(1) dt <- data.table( id = c(1:5, na), name = c("bob","ana","smith","sam","big","lulu"), kind_2001 = factor(sample(c(letters[1:3], na), 6, true)), kind_2002 = factor(sample(c(letters[1:3], na), 6, true)), kind_2003 = factor(sample(c(letters[1:3], na), 6, true)), conc_2001 = sample(99,6), conc_2002 = sample(79,6), conc_2003 = sample(49,6) ) id name kind_2001 kind_2002 kind_2003 conc_2001 conc_2002 conc_2003 1 bob b na c 38 22 24 2 ana b c b 77 31 29 3 smith c c na 91 2 49 4 sam na b 21 30 9 5 big c 62 66 38 na lulu na na 12 26 30
and this:
id name year kind conc 1 bob 2001 b 38 1 bob 2002 na 22 1 bob 2003 c 24 2 ana 2001 b 77 2 ana 2002 c 31 2 ana 2003 b 29 ...
the real table has many more variables, i'm looking solution without explicitly saying every column name or number, detecting automatically pattern.
have 2 kind of columns, ending underscore , 4 digit year, such _2001, , other without ending.
can have underscore in middle of name (this kept untransformed).
transform columns ending year long format.
i've tried with
melt(dt, id=1:2, variable.name = "year") or melt(dt, id=1:2, measure=patterns("_2[0-9][0-9][0-9]$"))
but i'm not getting want.
maybe first need filter names gsub.
pd: i've found solution.
posi <- grep("_[0-9][0-9][0-9][0-9]$",colnames(dt)) work <- unique(gsub("_[0-9][0-9][0-9][0-9]$","",colnames(dt)[posi])) melt(dt, measure=patterns(paste0("^",work)), variable="year", value.name=work)
it works year column not populated properly. i'm missing or it's bug. , i'm sure written simpler.
id name year kind conc 1 bob 1 b 38 2 ana 1 b 77 3 smith 1 c 91 4 sam 1 na 21 5 big 1 62 na lulu 1 na 12 1 bob 2 na 22 2 ana 2 c 31 3 smith 2 c 2 4 sam 2 30 5 big 2 66 na lulu 2 26 1 bob 3 c 24 2 ana 3 b 29 3 smith 3 na 49 4 sam 3 b 9 5 big 3 c 38 na lulu 3 na 30
regards
i've tried eddi solution database , error:
"error: cannot allocate vector of size 756.5 mb" though have 16gb of memory.
here's option that's more robust respect order of columns, missing/extra years:
dcast(melt(dt, id.vars = c("id", "name")) [, .(id, name, sub('_.*', '', variable), sub('.*_', '', variable), value)], id + name + v4 ~ v3) # id name v4 conc kind # 1: 1 bob 2001 38 b # 2: 1 bob 2002 22 na # 3: 1 bob 2003 24 c # 4: 2 ana 2001 77 b # 5: 2 ana 2002 31 c # 6: 2 ana 2003 29 b # 7: 3 smith 2001 91 c # 8: 3 smith 2002 2 c # 9: 3 smith 2003 49 na #10: 4 sam 2001 21 na #11: 4 sam 2002 30 #12: 4 sam 2003 9 b #13: 5 big 2001 62 #14: 5 big 2002 66 #15: 5 big 2003 38 c #16: na lulu 2001 12 na #17: na lulu 2002 26 #18: na lulu 2003 30 na
edit many id columns:
idvars = grep("_", names(dt), invert = true) dcast(melt(dt, id.vars = idvars) [, `:=`(var = sub('_.*', '', variable), year = sub('.*_', '', variable), variable = null)], ... ~ var, value.var='value')
Comments
Post a Comment