Estimate Means and Standard Deviations from Five-Number Summary Values

Function to estimate means and standard deviations from five-number summary values.

conv.fivenum(min, q1, median, q3, max, n, data, include,
             method="default", dist="norm", transf=TRUE, test=TRUE,
             var.names=c("mean","sd"), append=TRUE, replace="ifna", ...)

Arguments

min: vector with the minimum values.
q1: vector with the lower/first quartile values.
median: vector with the median values.
q3: vector with the upper/third quartile values.
max: vector with the maximum values.
n: vector with the sample sizes.
data: optional data frame containing the variables given to the arguments above.
include: optional (logical or numeric) vector to specify the subset of studies for which means and standard deviations should be estimated.
method: character string to specify the method to use. Either "default" (same as "luo/wan/shi" which is the current default), "qe", "bc", "mln", or "blue". Can be abbreviated. See ‘Details’.
dist: character string to specify the assumed distribution for the underlying data (either "norm" for a normal distribution or "lnorm" for a log-normal distribution). Can also be a string vector if different distributions are assumed for different studies. Only relevant when method="default".
transf: logical to specify whether the estimated means and standard deviations of the log-transformed data should be back-transformed as described by Shi et al. (2020b) (the default is TRUE). Only relevant when dist="lnorm" and when method="default".
test: logical to specify whether a study should be excluded from the estimation if the test for skewness is significant (the default is TRUE, but whether this is applicable depends on the method; see ‘Details’).
var.names: character vector with two elements to specify the name of the variable for the estimated means and the name of the variable for the estimated standard deviations (the defaults are "mean" and "sd").
append: logical to specify whether the data frame provided via the data argument should be returned together with the estimated values (the default is TRUE).
replace: character string or logical to specify how values in var.names should be replaced (only relevant when using the data argument and if variables in var.names already exist in the data frame). See the ‘Value’ section for more details.
...: other arguments.

Details

Various effect size measures require means and standard deviations (SDs) as input (e.g., raw or standardized mean differences, ratios of means / response ratios; see escalc for further details). For some studies, authors may not report means and SDs, but other statistics, such as the so-called ‘five-number summary’, consisting of the minimum, lower/first quartile, median, upper/third quartile, and the maximum of the sample values (plus the sample sizes). Occasionally, only a subset of these values are reported.

The present function can be used to estimate means and standard deviations from five-number summary values based on various methods described in the literature (Bland, 2015; Cai et al. 2021; Hozo et al., 2005; Luo et al., 2016; McGrath et al., 2020; Shi et al., 2020a; Walter & Yao, 2007; Wan et al., 2014; Yang et al., 2022).

When method="default" (which is the same as "luo/wan/shi"), the following methods are used:

Case 1: Min, Median, Max

In case only the minimum, median, and maximum is available for a study (plus the sample size), then the function uses the method by Luo et al. (2016), equation (7), to estimate the mean and the method by Wan et al. (2014), equation (9), to estimate the SD.

Case 2: Q1, Median, Q3

In case only the lower/first quartile, median, and upper/third quartile is available for a study (plus the sample size), then the function uses the method by Luo et al. (2016), equation (11), to estimate the mean and the method by Wan et al. (2014), equation (16), to estimate the SD.

Case 3: Min, Q1, Median, Q3, Max

In case the full five-number summary is available for a study (plus the sample size), then the function uses the method by Luo et al. (2016), equation (15), to estimate the mean and the method by Shi et al. (2020a), equation (10), to estimate the SD.

The median is not actually needed in the methods by Wan et al. (2014) and Shi et al. (2020a) and hence it is possible to estimate the SD even if the median is unavailable (this can be useful if a study reports the mean directly, but instead of the SD, it reports the minimum/maximum and/or first/third quartile values).

Note that the sample size must be at least 5 to apply these methods. Studies where the sample size is smaller are not included in the estimation. The function also checks that min <= q1 <= median <= q3 <= max and throws an error if any studies are found where this is not the case.

Test for Skewness

The methods described above were derived under the assumption that the data are normally distributed. Testing this assumption would require access to the raw data, but based on the three cases above, Shi et al. (2023) derived tests for skewness that only require the reported quantile values and the sample sizes. These tests are automatically carried out. When test=TRUE (which is the default), a study is automatically excluded from the estimation if the test is significant. If all studies should be included, set test=FALSE, but note that the accuracy of the methods will tend to be poorer when the data come from an apparently skewed (and hence non-normal) distribution.

Log-Normal Distribution

When setting dist="lnorm", the raw data are assumed to follow a log-normal distribution. In this case, the methods as described by Shi et al. (2020b) are used to estimate the mean and SD of the log transformed data for the three cases above. When transf=TRUE (the default), the estimated mean and SD of the log transformed data are back-transformed to the estimated mean and SD of the raw data (using the bias-corrected back-transformation as described by Shi et al., 2020b). Note that the test for skewness is also carried out when dist="lnorm", but now testing if the log transformed data exhibit skewness.

Alternative Methods

As an alternative to the methods above, one can make use of the methods implemented in the estmeansd package to estimate means and SDs based on the three cases above. Available are the quantile estimation method (method="qe"; using the qe.mean.sd function; McGrath et al., 2020), the Box-Cox method (method="bc"; using the bc.mean.sd function; McGrath et al., 2020), and the method for unknown non-normal distributions (method="mln"; using the mln.mean.sd function; Cai et al. 2021). The advantage of these methods is that they do not assume that the data underlying the reported values are normally distributed (and hence the test argument is ignored), but they can only be used when the values are positive (except for the quantile estimation method, which can also be used when one or more of the values are negative, but in this case the method does assume that the data are normally distributed and hence the test for skewness is applied when test=TRUE). Note that all of these methods may struggle to provide sensible estimates when some of the values are equal to each other (which can happen when the data include a lot of ties and/or the reported values are rounded). Also, the Box-Cox method and the method for unknown non-normal distributions involve simulated data and hence results will slightly change on repeated runs. Setting the seed of the random number generator (with set.seed) ensures reproducibility.

Finally, by setting method="blue", one can make use of the BLUE_s function from the metaBLUE package to estimate means and SDs based on the three cases above (Yang et al., 2022). The method assumes that the underlying data are normally distributed (and hence the test for skewness is applied when test=TRUE).

Value

If the data argument was not specified or append=FALSE, a data frame with two variables called var.names[1] (by default "mean") and var.names[2] (by default "sd") with the estimated means and SDs.

If data was specified and append=TRUE, then the original data frame is returned. If var.names[1] is a variable in data and replace="ifna" (or replace=FALSE), then only missing values in this variable are replaced with the estimated means (where possible) and otherwise a new variable called var.names[1] is added to the data frame. Similarly, if var.names[2] is a variable in data and replace="ifna" (or replace=FALSE), then only missing values in this variable are replaced with the estimated SDs (where possible) and otherwise a new variable called var.names[2] is added to the data frame.

If replace="all" (or replace=TRUE), then all values in var.names[1] and var.names[2] where an estimated mean and SD can be computed are replaced, even for cases where the value in var.names[1] and var.names[2] is not missing.

When missing values in var.names[1] are replaced, an attribute called "est" is added to the variable, which is a logical vector that is TRUE for values that were estimated. The same is done when missing values in var.names[2] are replaced.

Attributes called "tval", "crit", "sig", and "dist" are also added to var.names[1] corresponding to the test statistic and critical value for the test for skewness, whether the test was significant, and the assumed distribution (for the quantile estimation method, this is the distribution that provides the best fit to the given values).

Note

A word of caution: Under the given distributional assumptions, the estimated means and SDs are approximately unbiased and hence so are any effect size measures computed based on them (assuming a measure is unbiased to begin with when computed with directly reported means and SDs). However, the estimated means and SDs are less precise (i.e., are more variable) than directly reported means and SDs (especially under case 1) and hence computing the sampling variance of a measure with equations that assume that directly reported means and SDs are available will tend to underestimate the actual sampling variance of the measure, giving too much weight to estimates computed based on estimated means and SDs (see also McGrath et al., 2023). It would therefore be prudent to treat effect size estimates computed from estimated means and SDs with caution (e.g., by examining in a moderator analysis whether there are systematic differences between studies directly reporting means and SDs and those where the means and SDs needed to be estimated and/or as part of a sensitivity analysis). McGrath et al. (2023) also suggest to use bootstrapping to estimate the sampling variance of effect size measures computed based on estimated means and SDs. See also the metamedian package for this purpose.

Also note that the development of methods for estimating means and SDs based on five-number summary values is an active area of research. Currently, when method="default", then this is identical to method="luo/wan/shi", but this might change in the future. For reproducibility, it is therefore recommended to explicitly set method="luo/wan/shi" (or one of the other methods) when running this function.

Author

Wolfgang Viechtbauer (wvb@metafor-project.org, https://www.metafor-project.org).

References

Bland, M. (2015). Estimating mean and standard deviation from the sample size, three quartiles, minimum, and maximum. International Journal of Statistics in Medical Research, 4(1), 57–64. https://doi.org/10.6000/1929-6029.2015.04.01.6

Cai, S., Zhou, J., & Pan, J. (2021). Estimating the sample mean and standard deviation from order statistics and sample size in meta-analysis. Statistical Methods in Medical Research, 30(12), 2701–2719. https://doi.org/10.1177/09622802211047348

Hozo, S. P., Djulbegovic, B. & Hozo, I. (2005). Estimating the mean and variance from the median, range, and the size of a sample. BMC Medical Research Methodology, 5, 13. https://doi.org/10.1186/1471-2288-5-13

Luo, D., Wan, X., Liu, J. & Tong, T. (2016). Optimally estimating the sample mean from the sample size, median, mid-range, and/or mid-quartile range. Statistical Methods in Medical Research, 27(6), 1785–1805. https://doi.org/10.1177/0962280216669183

McGrath, S., Zhao, X., Steele, R., Thombs, B. D., Benedetti, A., & the DEPRESsion Screening Data (DEPRESSD) Collaboration (2020). Estimating the sample mean and standard deviation from commonly reported quantiles in meta-analysis. Statistical Methods in Medical Research, 29(9), 2520–2537. https://doi.org/10.1177/0962280219889080

McGrath, S., Katzenschlager, S., Zimmer, A. J., Seitel, A., Steele, R., & Benedetti, A. (2023). Standard error estimation in meta-analysis of studies reporting medians. Statistical Methods in Medical Research, 32(2), 373–388. https://doi.org/10.1177/09622802221139233

Shi, J., Luo, D., Weng, H., Zeng, X.-T., Lin, L., Chu, H. & Tong, T. (2020a). Optimally estimating the sample standard deviation from the five-number summary. Research Synthesis Methods, 11(5), 641–654. https://doi.org/https://doi.org/10.1002/jrsm.1429

Shi, J., Tong, T., Wang, Y. & Genton, M. G. (2020b). Estimating the mean and variance from the five-number summary of a log-normal distribution. Statistics and Its Interface, 13(4), 519–531. https://doi.org/10.4310/sii.2020.v13.n4.a9

Shi, J., Luo, D., Wan, X., Liu, Y., Liu, J., Bian, Z. & Tong, T. (2023). Detecting the skewness of data from the five-number summary and its application in meta-analysis. Statistical Methods in Medical Research, 32(7), 1338–1360. https://doi.org/10.1177/09622802231172043

Walter, S. D. & Yao, X. (2007). Effect sizes can be calculated for studies reporting ranges for outcome variables in systematic reviews. Journal of Clinical Epidemiology, 60(8), 849–852. https://doi.org/10.1016/j.jclinepi.2006.11.003

Wan, X., Wang, W., Liu, J. & Tong, T. (2014). Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range. BMC Medical Research Methodology, 14, 135. https://doi.org/10.1186/1471-2288-14-135

Yang, X., Hutson, A. D., & Wang, D. (2022). A generalized BLUE approach for combining location and scale information in a meta-analysis. Journal of Applied Statistics, 49(15), 3846–3867. https://doi.org/10.1080/02664763.2021.1967890

Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3), 1–48. https://doi.org/10.18637/jss.v036.i03

Examples

# example data frame
dat <- data.frame(case=c(1:3,NA), min=c(2,NA,2,NA), q1=c(NA,4,4,NA),
                  median=c(6,6,6,NA), q3=c(NA,10,10,NA), max=c(14,NA,14,NA),
                  mean=c(NA,NA,NA,7.0), sd=c(NA,NA,NA,4.2), n=c(20,20,20,20))
dat
#>   case min q1 median q3 max mean  sd  n
#> 1    1   2 NA      6 NA  14   NA  NA 20
#> 2    2  NA  4      6 10  NA   NA  NA 20
#> 3    3   2  4      6 10  14   NA  NA 20
#> 4   NA  NA NA     NA NA  NA    7 4.2 20

# note that study 4 provides the mean and SD directly, while studies 1-3 provide five-number
# summary values or a subset thereof (corresponding to cases 1-3 above)

# estimate means/SDs (note: existing values in 'mean' and 'sd' are not touched)
dat <- conv.fivenum(min=min, q1=q1, median=median, q3=q3, max=max, n=n, data=dat)
dat
#>   case min q1 median q3 max     mean       sd  n
#> 1    1   2 NA      6 NA  14 6.594468 3.211576 20
#> 2    2  NA  4      6 10  NA 6.719500 4.787076 20
#> 3    3   2  4      6 10  14 6.938841 3.679435 20
#> 4   NA  NA NA     NA NA  NA 7.000000 4.200000 20

# check attributes (none of the tests are significant, so means/SDs are estimated for studies 1-3)
dfround(data.frame(attributes(dat$mean)), digits=3)
#>     est  tval  crit   sig dist
#> 1  TRUE 0.333 0.416 FALSE norm
#> 2  TRUE 0.333 0.578 FALSE norm
#> 3  TRUE 0.491 0.666 FALSE norm
#> 4 FALSE    NA    NA    NA norm

# calculate the log transformed coefficient of variation and corresponding sampling variance
dat <- escalc(measure="CVLN", mi=mean, sdi=sd, ni=n, data=dat)
dat
#> 
#>   case min q1 median q3 max     mean       sd  n      yi     vi 
#> 1    1   2 NA      6 NA  14 6.594468 3.211576 20 -0.6932 0.0382 
#> 2    2  NA  4      6 10  NA 6.719500 4.787076 20 -0.3128 0.0517 
#> 3    3   2  4      6 10  14 6.938841 3.679435 20 -0.6081 0.0404 
#> 4   NA  NA NA     NA NA  NA 7.000000 4.200000 20 -0.4845 0.0443 
#> 

# fit equal-effects model to the estimates
res <- rma(yi, vi, data=dat, method="EE")
res
#> 
#> Equal-Effects Model (k = 4)
#> 
#> I^2 (total heterogeneity / total variability):   0.00%
#> H^2 (total variability / sampling variability):  0.60
#> 
#> Test for Heterogeneity:
#> Q(df = 3) = 1.7974, p-val = 0.6155
#> 
#> Model Results:
#> 
#> estimate      se     zval    pval    ci.lb    ci.ub      
#>  -0.5405  0.1038  -5.2092  <.0001  -0.7439  -0.3372  *** 
#> 
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 

# estimated coefficient of variation (with 95% CI)
predict(res, transf=exp, digits=2)
#> 
#>  pred ci.lb ci.ub 
#>  0.58  0.48  0.71 
#> 

############################################################################

# example data frame
dat <- data.frame(case=c(1:3,NA), min=c(2,NA,2,NA), q1=c(NA,4,4,NA),
                  median=c(6,6,6,NA), q3=c(NA,10,10,NA), max=c(14,NA,14,NA),
                  mean=c(NA,NA,NA,7.0), sd=c(NA,NA,NA,4.2), n=c(20,20,20,20))
dat
#>   case min q1 median q3 max mean  sd  n
#> 1    1   2 NA      6 NA  14   NA  NA 20
#> 2    2  NA  4      6 10  NA   NA  NA 20
#> 3    3   2  4      6 10  14   NA  NA 20
#> 4   NA  NA NA     NA NA  NA    7 4.2 20

# try out different methods
conv.fivenum(min=min, q1=q1, median=median, q3=q3, max=max, n=n, data=dat)
#>   case min q1 median q3 max     mean       sd  n
#> 1    1   2 NA      6 NA  14 6.594468 3.211576 20
#> 2    2  NA  4      6 10  NA 6.719500 4.787076 20
#> 3    3   2  4      6 10  14 6.938841 3.679435 20
#> 4   NA  NA NA     NA NA  NA 7.000000 4.200000 20
set.seed(1234)
conv.fivenum(min=min, q1=q1, median=median, q3=q3, max=max, n=n, data=dat, method="qe")
#>   case min q1 median q3 max     mean       sd  n
#> 1    1   2 NA      6 NA  14 6.762910 3.803302 20
#> 2    2  NA  4      6 10  NA 7.922472 6.337109 20
#> 3    3   2  4      6 10  14 7.043956 3.852243 20
#> 4   NA  NA NA     NA NA  NA 7.000000 4.200000 20
conv.fivenum(min=min, q1=q1, median=median, q3=q3, max=max, n=n, data=dat, method="bc")
#>   case min q1 median q3 max     mean       sd  n
#> 1    1   2 NA      6 NA  14 6.470861 3.262934 20
#> 2    2  NA  4      6 10  NA 8.135197 6.489805 20
#> 3    3   2  4      6 10  14 7.498873 4.978785 20
#> 4   NA  NA NA     NA NA  NA 7.000000 4.200000 20
conv.fivenum(min=min, q1=q1, median=median, q3=q3, max=max, n=n, data=dat, method="mln")
#>   case min q1 median q3 max     mean       sd  n
#> 1    1   2 NA      6 NA  14 6.436352 3.468935 20
#> 2    2  NA  4      6 10  NA 7.195510 4.972998 20
#> 3    3   2  4      6 10  14 6.782542 4.098021 20
#> 4   NA  NA NA     NA NA  NA 7.000000 4.200000 20
conv.fivenum(min=min, q1=q1, median=median, q3=q3, max=max, n=n, data=dat, method="blue")
#>   case min q1 median q3 max     mean       sd  n
#> 1    1   2 NA      6 NA  14 6.467044 3.187178 20
#> 2    2  NA  4      6 10  NA 6.379034 4.463030 20
#> 3    3   2  4      6 10  14 6.684728 3.571241 20
#> 4   NA  NA NA     NA NA  NA 7.000000 4.200000 20

############################################################################

# example data frame
dat <- data.frame(case=c(1:3,NA), min=c(2,NA,2,NA), q1=c(NA,4,4,NA),
                  median=c(6,6,6,NA), q3=c(NA,10,14,NA), max=c(14,NA,20,NA),
                  mean=c(NA,NA,NA,7.0), sd=c(NA,NA,NA,4.2), n=c(20,20,20,20))
dat
#>   case min q1 median q3 max mean  sd  n
#> 1    1   2 NA      6 NA  14   NA  NA 20
#> 2    2  NA  4      6 10  NA   NA  NA 20
#> 3    3   2  4      6 14  20   NA  NA 20
#> 4   NA  NA NA     NA NA  NA    7 4.2 20

# for study 3, the third quartile and maximum value suggest that the data have
# a right skewed distribution (they are much further away from the median than
# the minimum and first quartile)

# estimate means/SDs
dat <- conv.fivenum(min=min, q1=q1, median=median, q3=q3, max=max, n=n, data=dat)
dat
#>   case min q1 median q3 max     mean       sd  n
#> 1    1   2 NA      6 NA  14 6.594468 3.211576 20
#> 2    2  NA  4      6 10  NA 6.719500 4.787076 20
#> 3    3   2  4      6 14  20       NA       NA 20
#> 4   NA  NA NA     NA NA  NA 7.000000 4.200000 20

# note that the mean and SD are not estimated for study 3; this is because the
# test for skewness is significant for this study
dfround(data.frame(attributes(dat$mean)), digits=3)
#>     est  tval  crit   sig dist
#> 1  TRUE 0.333 0.416 FALSE norm
#> 2  TRUE 0.333 0.578 FALSE norm
#> 3 FALSE 0.818 0.666  TRUE norm
#> 4 FALSE    NA    NA    NA norm

# estimate means/SDs, but assume that the data for study 3 come from a log-normal distribution
# and back-transform the estimated mean/SD of the log-transformed data back to the raw data
dat <- conv.fivenum(min=min, q1=q1, median=median, q3=q3, max=max, n=n, data=dat,
                    dist=c("norm","norm","lnorm","norm"), replace="all")
dat
#>   case min q1 median q3 max     mean       sd  n
#> 1    1   2 NA      6 NA  14 6.594468 3.211576 20
#> 2    2  NA  4      6 10  NA 6.719500 4.787076 20
#> 3    3   2  4      6 14  20 8.758702 6.740320 20
#> 4   NA  NA NA     NA NA  NA 7.000000 4.200000 20

# this works now because the test for skewness of the log-transformed data is not significant
dfround(data.frame(attributes(dat$mean)), digits=3)
#>     est  tval  crit   sig  dist
#> 1  TRUE 0.333 0.416 FALSE  norm
#> 2  TRUE 0.333 0.578 FALSE  norm
#> 3  TRUE 0.353 0.666 FALSE lnorm
#> 4 FALSE    NA    NA    NA  norm