Manipulating dates alongside consecutive results

Multi tool use


Manipulating dates alongside consecutive results
I need some help working with consecutive results.
Here is my sample data:
df <- structure(list(idno = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
2, 2, 2), result = structure(c(1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L), .Label = c("Negative", "Positive"
), class = c("ordered", "factor")), samp_date = structure(c(15909,
15938, 15979, 16007, 16041, 16080, 16182, 16504, 16576, 16645,
16721, 16745, 17105, 17281, 17416, 17429), class = "Date")), class = "data.frame", row.names = c(NA,
-16L))
The 'idno' represents individual people who had a test with 'result' on a given date ('samp_date').
From each individual person, I need to find the earliest consecutive 'Negatives' and return the date of the first 'negative' result. To return this date, the consecutive negatives must span >30 days with no 'positive' results.
The example answer for idno == 1 would be 2013-10-29, and 2015-11-06 for idno == 2.
I have tried using rle(as.character(df$result))
but have struggled to understand how to apply this to grouped data.
rle(as.character(df$result))
I would prefer an approach that uses dplyr or data.table.
Thanks for any help.
3 Answers
3
A dplyr
based solution can be achieved by creating a group of consecutive occurrence of result
column and then finally taking 1st occurrence that meets criteria:
dplyr
result
library(dplyr)
df %>% mutate(samp_date = as.Date(samp_date)) %>%
group_by(idno) %>%
arrange(samp_date) %>%
mutate(result_grp = cumsum(as.character(result)!=lag(as.character(result),default=""))) %>%
group_by(idno, result_grp) %>%
filter( result == "Negative" & (max(samp_date) - min(samp_date) )>=30) %>%
slice(1) %>%
ungroup() %>%
select(-result_grp)
# # A tibble: 2 x 3
# idno result samp_date
# <dbl> <ord> <date>
# 1 1.00 Negative 2013-10-29
# 2 2.00 Negative 2015-11-06
Similar to @MKR's answer, you can make a grouping variable and summarize in data.table:
library(data.table)
setDT(df)[, samp_date := as.IDate(samp_date)]
# summarize by grouping var g = rleid(idno, result)
runDT = df[, .(
start = first(samp_date),
end = last(samp_date),
dur = difftime(last(samp_date), first(samp_date), units="days")
), by=.(idno, result, g = rleid(idno, result))]
# idno result g start end dur
# 1: 1 Negative 1 2013-07-23 2013-07-23 0 days
# 2: 1 Positive 2 2013-08-21 2013-10-01 41 days
# 3: 1 Negative 3 2013-10-29 2015-07-29 638 days
# 4: 2 Positive 4 2015-10-13 2015-10-13 0 days
# 5: 2 Negative 5 2015-11-06 2016-10-31 360 days
# 6: 2 Positive 6 2017-04-25 2017-09-20 148 days
# find rows meeting the criterion
w = runDT[.(idno = unique(idno), result = "Negative", min_dur = 30),
on=.(idno, result, dur >= min_dur), mult="first", which=TRUE]
# filter
runDT[w]
# idno result g start end dur
# 1: 1 Negative 3 2013-10-29 2015-07-29 638 days
# 2: 2 Negative 5 2015-11-06 2016-10-31 360 days
w
dur
min_dur
library(dplyr)
df %>% group_by(idno) %>%
mutate(time_diff = ifelse(result=="Negative" & lead(result)=='Negative', samp_date - lead(samp_date),0),
ConsNegDate = min(samp_date[which(abs(time_diff)>30)]))
# A tibble: 16 x 5
# Groups: idno [2]
idno result samp_date time_diff ConsNegDate
<dbl> <ord> <date> <dbl> <date>
1 1 Negative 2013-07-23 0 2013-10-29
2 1 Positive 2013-08-21 0 2013-10-29
3 1 Positive 2013-10-01 0 2013-10-29
4 1 Negative 2013-10-29 -34 2013-10-29
5 1 Negative 2013-12-02 -39 2013-10-29
6 1 Negative 2014-01-10 -102 2013-10-29
7 1 Negative 2014-04-22 -322 2013-10-29
8 1 Negative 2015-03-10 -72 2013-10-29
9 1 Negative 2015-05-21 -69 2013-10-29
10 1 Negative 2015-07-29 NA 2013-10-29
11 2 Positive 2015-10-13 0 2015-11-06
12 2 Negative 2015-11-06 -360 2015-11-06
13 2 Negative 2016-10-31 0 2015-11-06
14 2 Positive 2017-04-25 0 2015-11-06
15 2 Positive 2017-09-07 0 2015-11-06
16 2 Positive 2017-09-20 0 2015-11-06
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
I'd prefer to just do the join rather than using
w
, but the contents ofdur
would be filled withmin_dur
, which is not ideal... Not sure which issue this is related to, maybe github.com/Rdatatable/data.table/issues/1615– Frank
3 mins ago