Aggregate Multiple Effect Sizes or Outcomes Within Studies

Function to aggregate multiple effect sizes or outcomes belonging to the same study (or to the same level of some other clustering variable) into a single combined effect size or outcome.

# S3 method for class 'escalc'
aggregate(x, cluster, time, obs, V, struct="CS", rho, phi,
          weighted=TRUE, checkpd=TRUE, fun, na.rm=TRUE,
          addk=FALSE, subset, select, digits, var.names, ...)

Arguments

x: an object of class "escalc".
cluster: vector to specify the clustering variable (e.g., study).
time: optional vector to specify the time points (only relevant when struct="CAR", "CS+CAR", or "CS*CAR").
obs: optional vector to distinguish different observed effect sizes or outcomes measured at the same time point (only relevant when struct="CS*CAR").
V: optional argument to specify the variance-covariance matrix of the sampling errors. If unspecified, argument struct is used to specify the variance-covariance structure.
struct: character string to specify the variance-covariance structure of the sampling errors within the same cluster (either "ID", "CS", "CAR", "CS+CAR", or "CS*CAR"). See ‘Details’.
rho: value of the correlation of the sampling errors within clusters (when struct="CS", "CS+CAR", or "CS*CAR"). Can also be a vector with the value of the correlation for each cluster.
phi: value of the autocorrelation of the sampling errors within clusters (when struct="CAR", "CS+CAR", or "CS*CAR"). Can also be a vector with the value of the autocorrelation for each cluster.
weighted: logical to specify whether estimates within clusters should be aggregated using inverse-variance weighting (the default is TRUE). If set to FALSE, unweighted averages are computed.
checkpd: logical to specify whether to check that the variance-covariance matrices of the sampling errors within clusters are positive definite (the default is TRUE).
fun: optional list with three functions for aggregating other variables besides the effect sizes or outcomes within clusters (for numeric/integer variables, for logicals, and for all other types, respectively).
na.rm: logical to specify whether NA values should be removed before aggregating values within clusters (the default is TRUE). Can also be a vector with two logicals (the first pertaining to the effect sizes or outcomes, the second to all other variables).
addk: logical to specify whether to add the cluster size as a new variable (called ki) to the dataset (the default is FALSE).
subset: optional (logical or numeric) vector to specify the subset of rows to include when aggregating the effect sizes or outcomes.
select: optional vector to specify the names of the variables to include in the aggregated dataset.
digits: optional integer to specify the number of decimal places to which the printed results should be rounded (the default is to take the value from the object).
var.names: optional character vector with two elements to specify the name of the variable that contains the observed effect sizes or outcomes and the name of the variable with the corresponding sampling variances (when unspecified, the function attempts to set these automatically based on the object).
...: other arguments.

Details

In many meta-analyses, multiple effect sizes or outcomes can be extracted from the same study. Ideally, such structures should be analyzed using an appropriate multilevel/multivariate model as can be fitted with the rma.mv function. However, there may occasionally be reasons for aggregating multiple effect sizes or outcomes belonging to the same study (or to the same level of some other clustering variable) into a single combined effect size or outcome. The present function can be used for this purpose.

The input must be an object of class "escalc". The error ‘Error in match.fun(FUN): argument "FUN" is missing, with no default’ indicates that a regular data frame was passed to the function, but this does not work. One can turn a regular data frame (containing the effect sizes or outcomes and the corresponding sampling variances) into an "escalc" object with the escalc function. See the ‘Examples’ below for an illustration of this.

The cluster variable is used to specify which estimates/outcomes belong to the same study/cluster.

In the simplest case, the estimates/outcomes within clusters (or, to be precise, their sampling errors) are assumed to be independent. This is usually a safe assumption as long as each study participant (or whatever the study units are) only contributes data to a single estimate/outcome. For example, if a study provides effect size estimates for male and female subjects separately, then the sampling errors can usually be assumed to be independent. In this case, one can set struct="ID" and multiple estimates/outcomes within the same cluster are combined using standard inverse-variance weighting (i.e., using weighted least squares) under the assumption of independence.

In other cases, the estimates/outcomes within clusters cannot be assumed to be independent. For example, if multiple effect size estimates are computed for the same group of subjects (e.g., based on different scales to measure some construct of interest), then the estimates are likely to be correlated. If the actual correlation between the estimates is unknown, one can often still make an educated guess and set argument rho to this value, which is then assumed to be the same for all pairs of estimates within clusters when struct="CS" (for a compound symmetric structure). Multiple estimates/outcomes within the same cluster are then combined using inverse-variance weighting taking their correlation into consideration (i.e., using generalized least squares). One can also specify a different value of rho for each cluster by passing a vector (of the same length as the number of clusters) to this argument.

If multiple effect size estimates are computed for the same group of subjects at different time points, then it may be more sensible to assume that the correlation between estimates decreases as a function of the distance between the time points. If so, one can specify struct="CAR" (for a continuous-time autoregressive structure), set phi to the autocorrelation (for two estimates one time-unit apart), and use argument time to specify the actual time points corresponding to the estimates. The correlation between two estimates, \(y_{it}\) and \(y_{it'}\), in the \(i\text{th}\) cluster, with time points \(\text{time}_{it}\) and \(\text{time}_{it'}\), is then given by \(\phi^{|\text{time}_{it} - \text{time}_{it'}|}\). One can also specify a different value of phi for each cluster by passing a vector (of the same length as the number of clusters) to this argument.

One can also combine the compound symmetric and autoregressive structures if there are multiple time points and multiple observed effect sizes or outcomes at these time points. One option is struct="CS+CAR". In this case, one must specify the time argument and both rho and phi. The correlation between two estimates, \(y_{it}\) and \(y_{it'}\), in the \(i\text{th}\) cluster, with time points \(\text{time}_{it}\) and \(\text{time}_{it'}\), is then given by \(\rho + (1 - \rho) \phi^{|\text{time}_{it} - \text{time}_{it'}|}\).

Alternatively, one can specify struct="CS*CAR". In this case, one must specify both the time and obs arguments and both rho and phi. The correlation between two estimates, \(y_{ijt}\) and \(y_{ijt'}\), with the same value for obs but different values for time, is then given by \(\phi^{|\text{time}_{ijt} - \text{time}_{ijt'}|}\), the correlation between two estimates, \(y_{ijt}\) and \(y_{ij't}\), with different values for obs but the same value for time, is then given by \(\rho\), and the correlation between two estimates, \(y_{ijt}\) and \(y_{ij't'}\), with different values for obs and different values for time, is then given by \(\rho \times \phi^{|\text{time}_{ijt} - \text{time}_{ijt'}|}\).

Finally, if one actually knows the correlation (and hence the covariance) between each pair of estimates (or has an approximation thereof), one can also specify the entire variance-covariance matrix of the estimates (or more precisely, their sampling errors) via the V argument (in this case, arguments struct, time, obs, rho, and phi are ignored). Note that the vcalc function can be used to construct such a V matrix and provides even more flexibility for specifying various types of dependencies. See the ‘Examples’ below for an illustration of this.

Instead of using inverse-variance weighting (i.e., weighted/generalized least squares) to combine the estimates within clusters, one can set weighted=FALSE in which case the estimates are averaged within clusters without any weighting (although the correlations between estimates as specified are still taken into consideration).

Other variables (besides the estimates) will also be aggregated to the cluster level. By default, numeric/integer type variables are averaged, logicals are also averaged (yielding the proportion of TRUE values), and for all other types of variables (e.g., character variables or factors) the most frequent category/level is returned. One can also specify a list of three functions via the fun argument for aggregating variables belonging to these three types.

Argument na.rm controls how missing values should be handled. By default, any missing estimates are first removed before aggregating the non-missing values within each cluster. The same applies when aggregating the other variables. One can also specify a vector with two logicals for the na.rm argument to control how missing values should be handled when aggregating the estimates and when aggregating all other variables.

Value

An object of class c("escalc","data.frame") that contains the (selected) variables aggregated to the cluster level.

The object is formatted and printed with the print function.

Author

Wolfgang Viechtbauer (wvb@metafor-project.org, https://www.metafor-project.org).

References

Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3), 1–48. https://doi.org/10.18637/jss.v036.i03

Examples

### copy data into 'dat' and examine data
dat <- dat.konstantopoulos2011
head(dat, 11)
#> 
#>    district school study year     yi    vi 
#> 1        11      1     1 1976 -0.180 0.118 
#> 2        11      2     2 1976 -0.220 0.118 
#> 3        11      3     3 1976  0.230 0.144 
#> 4        11      4     4 1976 -0.300 0.144 
#> 5        12      1     5 1989  0.130 0.014 
#> 6        12      2     6 1989 -0.260 0.014 
#> 7        12      3     7 1989  0.190 0.015 
#> 8        12      4     8 1989  0.320 0.024 
#> 9        18      1     9 1994  0.450 0.023 
#> 10       18      2    10 1994  0.380 0.043 
#> 11       18      3    11 1994  0.290 0.012 
#> 

### aggregate estimates to the district level, assuming independent sampling
### errors for multiples studies/schools within the same district
agg <- aggregate(dat, cluster=district, struct="ID", addk=TRUE)
agg
#> 
#>    district school study   year     yi    vi ki 
#> 1        11    2.5   2.5 1976.0 -0.126 0.032  4 
#> 2        12    2.5   6.5 1989.0  0.067 0.004  4 
#> 3        18    2.0  10.0 1994.0  0.350 0.007  3 
#> 4        27    2.5  13.5 1976.0  0.500 0.001  4 
#> 5        56    2.5  17.5 1997.0  0.051 0.002  4 
#> 6        58    6.0  25.0 1976.0 -0.042 0.001 11 
#> 7        71    2.0  32.0 1997.0  0.886 0.004  3 
#> 8        86    4.5  37.5 1997.0 -0.029 0.000  8 
#> 9        91    3.5  44.5 2000.0  0.250 0.002  6 
#> 10      108    3.0  50.0 2000.0  0.015 0.006  5 
#> 11      644    2.5  54.5 1994.5  0.162 0.019  4 
#> 

### copy data into 'dat' and examine data
dat <- dat.assink2016
head(dat, 19)
#> 
#>    study esid id      yi     vi pubstatus year deltype 
#> 1      1    1  1  0.9066 0.0740         1  4.5 general 
#> 2      1    2  2  0.4295 0.0398         1  4.5 general 
#> 3      1    3  3  0.2679 0.0481         1  4.5 general 
#> 4      1    4  4  0.2078 0.0239         1  4.5 general 
#> 5      1    5  5  0.0526 0.0331         1  4.5 general 
#> 6      1    6  6 -0.0507 0.0886         1  4.5 general 
#> 7      2    1  7  0.5117 0.0115         1  1.5 general 
#> 8      2    2  8  0.4738 0.0076         1  1.5 general 
#> 9      2    3  9  0.3544 0.0065         1  1.5 general 
#> 10     3    1 10  2.2844 0.3325         1 -8.5 general 
#> 11     3    2 11  2.1771 0.3073         1 -8.5 general 
#> 12     3    3 12  1.7777 0.2697         1 -8.5 general 
#> 13     3    4 13  1.5480 0.4533         1 -8.5 general 
#> 14     3    5 14  1.4855 0.1167         1 -6.5 general 
#> 15     3    6 15  1.4836 0.1706         1 -6.5 general 
#> 16     3    7 16  1.2777 0.1538         1 -8.5 general 
#> 17     3    8 17  1.0311 0.3132         1 -8.5 general 
#> 18     3    9 18  0.9409 0.1487         1 -6.5 general 
#> 19     3   10 19  0.6263 0.2139         1 -6.5 general 
#> 

### note: 'dat' is an 'escalc' object
class(dat)
#> [1] "escalc"     "data.frame"

### turn 'dat' into a regular data frame
dat <- as.data.frame(dat)
class(dat)
#> [1] "data.frame"

### turn data frame into an 'escalc' object
dat <- escalc(measure="SMD", yi=yi, vi=vi, data=dat)
class(dat)
#> [1] "escalc"     "data.frame"

### aggregate the estimates to the study level, assuming a CS structure for
### the sampling errors within studies with a correlation of 0.6
agg <- aggregate(dat, cluster=study, rho=0.6)
agg
#> 
#>    study esid   id      yi     vi pubstatus  year deltype 
#> 1      1  3.5  3.5  0.1629 0.0197         1   4.5 general 
#> 2      2  2.0  8.0  0.4060 0.0056         1   1.5 general 
#> 3      3  5.5 14.5  1.0790 0.0832         1  -7.7 general 
#> 4      4  1.0 20.0 -0.0447 0.0331         1  -7.5 general 
#> 5      5  1.0 21.0  1.5490 0.1384         1 -11.5 general 
#> 6      6  3.0 24.0 -0.0550 0.0214         1   5.5 general 
#> 7      7  3.5 29.5  1.0072 0.0545         1   3.5 general 
#> 8      8  1.0 33.0  0.3695 0.0199         1   1.5 general 
#> 9      9  1.5 34.5  0.1379 0.0271         1   5.5 general 
#> 10    10  3.0 38.0  0.1167 0.0107         1   4.5 general 
#> 11    11 11.5 51.5  0.5258 0.0114         0  -6.5 general 
#> 12    12  3.0 65.0  0.2805 0.0028         0   4.5   overt 
#> 13    13  1.5 68.5  0.3018 0.0110         1  -3.5 general 
#> 14    14  3.5 72.5  0.0356 0.0014         1  -1.5 general 
#> 15    15  2.0 77.0  0.0908 0.1269         1 -13.5 general 
#> 16    16  8.5 86.5  0.0181 0.0169         1   2.5  covert 
#> 17    17  3.5 97.5 -0.0552 0.0072         1   5.5 general 
#> 

### use vcalc() and then the V argument
V <- vcalc(vi, cluster=study, obs=esid, data=dat, rho=0.6)
agg <- aggregate(dat, cluster=study, V=V)
agg
#> 
#>    study esid   id      yi     vi pubstatus  year deltype 
#> 1      1  3.5  3.5  0.1629 0.0197         1   4.5 general 
#> 2      2  2.0  8.0  0.4060 0.0056         1   1.5 general 
#> 3      3  5.5 14.5  1.0790 0.0832         1  -7.7 general 
#> 4      4  1.0 20.0 -0.0447 0.0331         1  -7.5 general 
#> 5      5  1.0 21.0  1.5490 0.1384         1 -11.5 general 
#> 6      6  3.0 24.0 -0.0550 0.0214         1   5.5 general 
#> 7      7  3.5 29.5  1.0072 0.0545         1   3.5 general 
#> 8      8  1.0 33.0  0.3695 0.0199         1   1.5 general 
#> 9      9  1.5 34.5  0.1379 0.0271         1   5.5 general 
#> 10    10  3.0 38.0  0.1167 0.0107         1   4.5 general 
#> 11    11 11.5 51.5  0.5258 0.0114         0  -6.5 general 
#> 12    12  3.0 65.0  0.2805 0.0028         0   4.5   overt 
#> 13    13  1.5 68.5  0.3018 0.0110         1  -3.5 general 
#> 14    14  3.5 72.5  0.0356 0.0014         1  -1.5 general 
#> 15    15  2.0 77.0  0.0908 0.1269         1 -13.5 general 
#> 16    16  8.5 86.5  0.0181 0.0169         1   2.5  covert 
#> 17    17  3.5 97.5 -0.0552 0.0072         1   5.5 general 
#> 

### use a correlation of 0.7 for effect sizes corresponding to the same type of
### delinquent behavior and a correlation of 0.5 for effect sizes corresponding
### to different types of delinquent behavior
V <- vcalc(vi, cluster=study, type=deltype, obs=esid, data=dat, rho=c(0.7, 0.5))
agg <- aggregate(dat, cluster=study, V=V)
agg
#> 
#>    study esid   id      yi     vi pubstatus  year deltype 
#> 1      1  3.5  3.5  0.1275 0.0190         1   4.5 general 
#> 2      2  2.0  8.0  0.3922 0.0059         1   1.5 general 
#> 3      3  5.5 14.5  0.9909 0.0783         1  -7.7 general 
#> 4      4  1.0 20.0 -0.0447 0.0331         1  -7.5 general 
#> 5      5  1.0 21.0  1.5490 0.1384         1 -11.5 general 
#> 6      6  3.0 24.0 -0.0838 0.0193         1   5.5 general 
#> 7      7  3.5 29.5  1.0821 0.0573         1   3.5 general 
#> 8      8  1.0 33.0  0.3695 0.0199         1   1.5 general 
#> 9      9  1.5 34.5  0.1372 0.0287         1   5.5 general 
#> 10    10  3.0 38.0  0.1025 0.0102         1   4.5 general 
#> 11    11 11.5 51.5  0.4522 0.0093         0  -6.5 general 
#> 12    12  3.0 65.0  0.2840 0.0028         0   4.5   overt 
#> 13    13  1.5 68.5  0.3016 0.0117         1  -3.5 general 
#> 14    14  3.5 72.5  0.0334 0.0012         1  -1.5 general 
#> 15    15  2.0 77.0  0.0827 0.1359         1 -13.5 general 
#> 16    16  8.5 86.5 -0.2268 0.0176         1   2.5  covert 
#> 17    17  3.5 97.5 -0.0914 0.0071         1   5.5 general 
#> 

### reshape 'dat.ishak2007' into long format
dat <- dat.ishak2007
dat <- reshape(dat.ishak2007, direction="long", idvar="study", v.names=c("yi","vi"),
               varying=list(c(2,4,6,8), c(3,5,7,9)))
dat <- dat[order(study, time),]
dat <- dat[!is.na(yi),]
rownames(dat) <- NULL
head(dat, 8)
#> 
#>               study mdur mbase time    yi    vi 
#> 1    Alegret (2001) 16.1  53.6    1 -33.4  14.3 
#> 2 Barichella (2003) 13.5  45.3    1 -20.0   7.3 
#> 3 Barichella (2003) 13.5  45.3    3 -30.0   5.7 
#> 4     Berney (2002) 13.6  45.6    1 -21.1   7.3 
#> 5   Burchiel (1999) 13.6  48.0    1 -20.0   8.0 
#> 6   Burchiel (1999) 13.6  48.0    2 -20.0   8.0 
#> 7   Burchiel (1999) 13.6  48.0    3 -18.0   5.0 
#> 8       Chen (2003) 12.1  65.7    2 -32.9 125.0 
#> 

### aggregate the estimates to the study level, assuming a CAR structure for
### the sampling errors within studies with an autocorrelation of 0.9
agg <- aggregate(dat, cluster=study, struct="CAR", time=time, phi=0.9)
head(agg, 5)
#> 
#>               study mdur mbase time    yi    vi 
#> 1    Alegret (2001) 16.1  53.6    1 -33.4  14.3 
#> 2 Barichella (2003) 13.5  45.3    2 -28.1   5.6 
#> 3     Berney (2002) 13.6  45.6    1 -21.1   7.3 
#> 4   Burchiel (1999) 13.6  48.0    2 -17.2   4.6 
#> 5       Chen (2003) 12.1  65.7    2 -32.9 125.0 
#>