`conv.fivenum.Rd`

Function to estimate means and standard deviations from five-number summary values.

```
conv.fivenum(min, q1, median, q3, max, n, data, include,
method="default", dist="norm", transf=TRUE, test=TRUE,
var.names=c("mean","sd"), append=TRUE, replace="ifna", ...)
```

- min
vector with the minimum values.

- q1
vector with the lower/first quartile values.

- median
vector with the median values.

- q3
vector with the upper/third quartile values.

- max
vector with the maximum values.

- n
vector with the sample sizes.

- data
optional data frame containing the variables given to the arguments above.

- include
optional (logical or numeric) vector to specify the subset of studies for which means and standard deviations should be estimated.

- method
character string indicating the method to use. Either

`"default"`

(same as`"luo/wan/shi"`

which is the current default),`"qe"`

,`"bc"`

,`"mln"`

, or`"blue"`

. Can be abbreviated. See ‘Details’.- dist
character string indicating the distribution assumed for the underlying data (either

`"norm"`

for a normal distribution or`"lnorm"`

for a log-normal distribution). Can also be a string vector if different distributions are assumed for different studies. Only relevant when`method="default"`

.- transf
logical to specify whether the estimated means and standard deviations of the log-transformed data should be back-transformed as described by Shi et al. (2020b) (the default is

`TRUE`

). Only relevant when`dist="lnorm"`

and when`method="default"`

.- test
logical to specify whether a study should be excluded from the estimation if the test for skewness is significant (the default is

`TRUE`

, but whether this is applicable depends on the method; see ‘Details’).- var.names
character vector with two elements to specify the name of the variable for the estimated means and the name of the variable for the estimated standard deviations (the defaults are

`"mean"`

and`"sd"`

).- append
logical to specify whether the data frame provided via the

`data`

argument should be returned together with the estimated values (the default is`TRUE`

).- replace
character string or logical to specify how values in

`var.names`

should be replaced (only relevant when using the`data`

argument and if variables in`var.names`

already exist in the data frame). See the ‘Value’ section for more details.- ...
other arguments.

Various effect size measures require means and standard deviations (SDs) as input (e.g., raw or standardized mean differences, ratios of means / response ratios; see `escalc`

for further details). For some studies, authors may not report means and SDs, but other statistics, such as the so-called ‘five-number summary’, consisting of the minimum, lower/first quartile, median, upper/third quartile, and the maximum of the sample values (plus the sample sizes). Occasionally, only a subset of these values are reported.

The present function can be used to estimate means and standard deviations from five-number summary values based on various methods described in the literature (Bland, 2015; Cai et al. 2021; Hozo et al., 2005; Luo et al., 2016; McGrath et al., 2020; Shi et al., 2020a; Walter & Yao, 2007; Wan et al., 2014; Yang et al., 2022).

When `method="default"`

(which is the same as `"luo/wan/shi"`

), the following methods are used:

In case only the minimum, median, and maximum is available for a study (plus the sample size), then the function uses the method by Luo et al. (2016), equation (7), to estimate the mean and the method by Wan et al. (2014), equation (9), to estimate the SD.

In case only the lower/first quartile, median, and upper/third quartile is available for a study (plus the sample size), then the function uses the method by Luo et al. (2016), equation (11), to estimate the mean and the method by Wan et al. (2014), equation (16), to estimate the SD.

In case the full five-number summary is available for a study (plus the sample size), then the function uses the method by Luo et al. (2016), equation (15), to estimate the mean and the method by Shi et al. (2020a), equation (10), to estimate the SD.

The median is not actually needed in the methods by Wan et al. (2014) and Shi et al. (2020a) and hence it is possible to estimate the SD even if the median is unavailable (this can be useful if a study reports the mean directly, but instead of the SD, it reports the minimum/maximum and/or first/third quartile values).

Note that the sample size must be at least 5 to apply these methods. Studies where the sample size is smaller are not included in the estimation. The function also checks that `min <= q1 <= median <= q3 <= max`

and throws an error if any studies are found where this is not the case.

The methods described above were derived under the assumption that the data are normally distributed. Testing this assumption would require access to the raw data, but based on the three cases above, Shi et al. (2023) derived tests for skewness that only require the reported quantile values and the sample sizes. These tests are automatically carried out. When `test=TRUE`

(which is the default), a study is automatically excluded from the estimation if the test is significant. If all studies should be included, set `test=FALSE`

, but note that the accuracy of the methods will tend to be poorer when the data come from an apparently skewed (and hence non-normal) distribution.

When setting `dist="lnorm"`

, the raw data are assumed to follow a log-normal distribution. In this case, the methods as described by Shi et al. (2020b) are used to estimate the mean and SD of the log transformed data for the three cases above. When `transf=TRUE`

(the default), the estimated mean and SD of the log transformed data are back-transformed to the estimated mean and SD of the raw data (using the bias-corrected back-transformation as described by Shi et al., 2020b). Note that the test for skewness is also carried out when `dist="lnorm"`

, but now testing if the log transformed data exhibit skewness.

As an alternative to the methods above, one can make use of the methods implemented in the estmeansd package to estimate means and SDs based on the three cases above. Available are the quantile estimation method (`method="qe"`

; using the `qe.mean.sd`

function; McGrath et al., 2020), the Box-Cox method (`method="bc"`

; using the `bc.mean.sd`

function; McGrath et al., 2020), and the method for unknown non-normal distributions (`method="mln"`

; using the `mln.mean.sd`

function; Cai et al. 2021). The advantage of these methods is that they do not assume that the data underlying the reported values are normally distributed (and hence the `test`

argument is ignored), but they can only be used when the values are positive (except for the quantile estimation method, which can also be used when one or more of the values are negative, but in this case the method does assume that the data are normally distributed and hence the test for skewness is applied when `test=TRUE`

). Note that all of these methods may struggle to provide sensible estimates when some of the values are equal to each other (which can happen when the data include a lot of ties and/or the reported values are rounded). Also, the Box-Cox method and the method for unknown non-normal distributions involve simulated data and hence results will slightly change on repeated runs. Setting the seed of the random number generator (with `set.seed`

) ensures reproducibility.

Finally, by setting `method="blue"`

, one can make use of the `BLUE_s`

function from the metaBLUE package to estimate means and SDs based on the three cases above (Yang et al., 2022). The method assumes that the underlying data are normally distributed (and hence the test for skewness is applied when `test=TRUE`

).

If the `data`

argument was not specified or `append=FALSE`

, a data frame with two variables called `var.names[1]`

(by default `"mean"`

) and `var.names[2]`

(by default `"sd"`

) with the estimated means and SDs.

If `data`

was specified and `append=TRUE`

, then the original data frame is returned. If `var.names[1]`

is a variable in `data`

and `replace="ifna"`

(or `replace=FALSE`

), then only missing values in this variable are replaced with the estimated means (where possible) and otherwise a new variable called `var.names[1]`

is added to the data frame. Similarly, if `var.names[2]`

is a variable in `data`

and `replace="ifna"`

(or `replace=FALSE`

), then only missing values in this variable are replaced with the estimated SDs (where possible) and otherwise a new variable called `var.names[2]`

is added to the data frame.

If `replace="all"`

(or `replace=TRUE`

), then all values in `var.names[1]`

and `var.names[2]`

where an estimated mean and SD can be computed are replaced, even for cases where the value in `var.names[1]`

and `var.names[2]`

is not missing.

When missing values in `var.names[1]`

are replaced, an attribute called `"est"`

is added to the variable, which is a logical vector that is `TRUE`

for values that were estimated. The same is done when missing values in `var.names[2]`

are replaced.

Attributes called `"tval"`

, `"crit"`

, `"sig"`

, and `"dist"`

are also added to `var.names[1]`

corresponding to the test statistic and critical value for the test for skewness, whether the test was significant, and the assumed distribution (for the quantile estimation method, this is the distribution that provides the best fit to the given values).

**A word of caution:** Under the given distributional assumptions, the estimated means and SDs are approximately unbiased and hence so are any effect size measures computed based on them (assuming a measure is unbiased to begin with when computed with directly reported means and SDs). However, the estimated means and SDs are less precise (i.e., are more variable) than directly reported means and SDs (especially under case 1) and hence computing the sampling variance of a measure with equations that assume that directly reported means and SDs are available will tend to underestimate the actual sampling variance of the measure, giving too much weight to estimates computed based on estimated means and SDs (see also McGrath et al., 2023). It would therefore be prudent to treat effect size estimates computed from estimated means and SDs with caution (e.g., by examining in a moderator analysis whether there are systematic differences between studies directly reporting means and SDs and those where the means and SDs needed to be estimated and/or as part of a sensitivity analysis). McGrath et al. (2023) also suggest to use bootstrapping to estimate the sampling variance of effect size measures computed based on estimated means and SDs. See also the metamedian package for this purpose.

Also note that the development of methods for estimating means and SDs based on five-number summary values is an active area of research. Currently, when `method="default"`

, then this is identical to `method="luo/wan/shi"`

, but this might change in the future. For reproducibility, it is therefore recommended to explicitly set `method="luo/wan/shi"`

(or one of the other methods) when running this function.

Bland, M. (2015). Estimating mean and standard deviation from the sample size, three quartiles, minimum, and maximum. *International Journal of Statistics in Medical Research*, **4**(1), 57–64. https://doi.org/10.6000/1929-6029.2015.04.01.6

Cai, S., Zhou, J., & Pan, J. (2021). Estimating the sample mean and standard deviation from order statistics and sample size in meta-analysis. *Statistical Methods in Medical Research*, **30**(12), 2701–2719. https://doi.org/10.1177/09622802211047348

Hozo, S. P., Djulbegovic, B. & Hozo, I. (2005). Estimating the mean and variance from the median, range, and the size of a sample. *BMC Medical Research Methodology*, **5**, 13. https://doi.org/10.1186/1471-2288-5-13

Luo, D., Wan, X., Liu, J. & Tong, T. (2016). Optimally estimating the sample mean from the sample size, median, mid-range, and/or mid-quartile range. *Statistical Methods in Medical Research*, **27**(6), 1785–1805. https://doi.org/10.1177/0962280216669183

McGrath, S., Zhao, X., Steele, R., Thombs, B. D., Benedetti, A., & the DEPRESsion Screening Data (DEPRESSD) Collaboration (2020). Estimating the sample mean and standard deviation from commonly reported quantiles in meta-analysis. *Statistical Methods in Medical Research*, **29**(9), 2520–2537. https://doi.org/10.1177/0962280219889080

McGrath, S., Katzenschlager, S., Zimmer, A. J., Seitel, A., Steele, R., & Benedetti, A. (2023). Standard error estimation in meta-analysis of studies reporting medians. *Statistical Methods in Medical Research*, **32**(2), 373–388. https://doi.org/10.1177/09622802221139233

Shi, J., Luo, D., Weng, H., Zeng, X.-T., Lin, L., Chu, H. & Tong, T. (2020a). Optimally estimating the sample standard deviation from the five-number summary. *Research Synthesis Methods*, **11**(5), 641–654. https://doi.org/https://doi.org/10.1002/jrsm.1429

Shi, J., Tong, T., Wang, Y. & Genton, M. G. (2020b). Estimating the mean and variance from the five-number summary of a log-normal distribution. Statistics and Its Interface, 13(4), 519-531. https://doi.org/10.4310/sii.2020.v13.n4.a9

Shi, J., Luo, D., Wan, X., Liu, Y., Liu, J., Bian, Z. & Tong, T. (2023). Detecting the skewness of data from the five-number summary and its application in meta-analysis. *Statistical Methods in Medical Research*. https://doi.org/10.1177/09622802231172043

Walter, S. D. & Yao, X. (2007). Effect sizes can be calculated for studies reporting ranges for outcome variables in systematic reviews. *Journal of Clinical Epidemiology*, **60**(8), 849-852. https://doi.org/10.1016/j.jclinepi.2006.11.003

Wan, X., Wang, W., Liu, J. & Tong, T. (2014). Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range. *BMC Medical Research Methodology*, **14**, 135. https://doi.org/10.1186/1471-2288-14-135

Yang, X., Hutson, A. D., & Wang, D. (2022). A generalized BLUE approach for combining location and scale information in a meta-analysis. *Journal of Applied Statistics*, **49**(15), 3846–3867. https://doi.org/10.1080/02664763.2021.1967890

Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. *Journal of Statistical Software*, **36**(3), 1–48. https://doi.org/10.18637/jss.v036.i03

`escalc`

for a function to compute various effect size measures based on means and standard deviations.

```
# example data frame
dat <- data.frame(case=c(1:3,NA), min=c(2,NA,2,NA), q1=c(NA,4,4,NA),
median=c(6,6,6,NA), q3=c(NA,10,10,NA), max=c(14,NA,14,NA),
mean=c(NA,NA,NA,7.0), sd=c(NA,NA,NA,4.2), n=c(20,20,20,20))
dat
#> case min q1 median q3 max mean sd n
#> 1 1 2 NA 6 NA 14 NA NA 20
#> 2 2 NA 4 6 10 NA NA NA 20
#> 3 3 2 4 6 10 14 NA NA 20
#> 4 NA NA NA NA NA NA 7 4.2 20
# note that study 4 provides the mean and SD directly, while studies 1-3 provide five-number
# summary values or a subset thereof (corresponding to cases 1-3 above)
# estimate means/SDs (note: existing values in 'mean' and 'sd' are not touched)
dat <- conv.fivenum(min=min, q1=q1, median=median, q3=q3, max=max, n=n, data=dat)
dat
#> case min q1 median q3 max mean sd n
#> 1 1 2 NA 6 NA 14 6.594468 3.211576 20
#> 2 2 NA 4 6 10 NA 6.719500 4.787076 20
#> 3 3 2 4 6 10 14 6.938841 3.679435 20
#> 4 NA NA NA NA NA NA 7.000000 4.200000 20
# check attributes (none of the tests are significant, so means/SDs are estimated for studies 1-3)
dfround(data.frame(attributes(dat$mean)), digits=3)
#> est tval crit sig dist
#> 1 TRUE 0.333 0.416 FALSE norm
#> 2 TRUE 0.333 0.578 FALSE norm
#> 3 TRUE 0.491 0.666 FALSE norm
#> 4 FALSE NA NA NA norm
# calculate the log transformed coefficient of variation and corresponding sampling variance
dat <- escalc(measure="CVLN", mi=mean, sdi=sd, ni=n, data=dat)
dat
#>
#> case min q1 median q3 max mean sd n yi vi
#> 1 1 2 NA 6 NA 14 6.594468 3.211576 20 -0.6932 0.0382
#> 2 2 NA 4 6 10 NA 6.719500 4.787076 20 -0.3128 0.0517
#> 3 3 2 4 6 10 14 6.938841 3.679435 20 -0.6081 0.0404
#> 4 NA NA NA NA NA NA 7.000000 4.200000 20 -0.4845 0.0443
#>
# fit equal-effects model to the estimates
res <- rma(yi, vi, data=dat, method="EE")
res
#>
#> Equal-Effects Model (k = 4)
#>
#> I^2 (total heterogeneity / total variability): 0.00%
#> H^2 (total variability / sampling variability): 0.60
#>
#> Test for Heterogeneity:
#> Q(df = 3) = 1.7974, p-val = 0.6155
#>
#> Model Results:
#>
#> estimate se zval pval ci.lb ci.ub
#> -0.5405 0.1038 -5.2092 <.0001 -0.7439 -0.3372 ***
#>
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
# estimated coefficient of variation (with 95% CI)
predict(res, transf=exp, digits=2)
#>
#> pred ci.lb ci.ub
#> 0.58 0.48 0.71
#>
############################################################################
# example data frame
dat <- data.frame(case=c(1:3,NA), min=c(2,NA,2,NA), q1=c(NA,4,4,NA),
median=c(6,6,6,NA), q3=c(NA,10,10,NA), max=c(14,NA,14,NA),
mean=c(NA,NA,NA,7.0), sd=c(NA,NA,NA,4.2), n=c(20,20,20,20))
dat
#> case min q1 median q3 max mean sd n
#> 1 1 2 NA 6 NA 14 NA NA 20
#> 2 2 NA 4 6 10 NA NA NA 20
#> 3 3 2 4 6 10 14 NA NA 20
#> 4 NA NA NA NA NA NA 7 4.2 20
# try out different methods
conv.fivenum(min=min, q1=q1, median=median, q3=q3, max=max, n=n, data=dat)
#> case min q1 median q3 max mean sd n
#> 1 1 2 NA 6 NA 14 6.594468 3.211576 20
#> 2 2 NA 4 6 10 NA 6.719500 4.787076 20
#> 3 3 2 4 6 10 14 6.938841 3.679435 20
#> 4 NA NA NA NA NA NA 7.000000 4.200000 20
set.seed(1234)
conv.fivenum(min=min, q1=q1, median=median, q3=q3, max=max, n=n, data=dat, method="qe")
#> case min q1 median q3 max mean sd n
#> 1 1 2 NA 6 NA 14 6.762910 3.803302 20
#> 2 2 NA 4 6 10 NA 7.922472 6.337109 20
#> 3 3 2 4 6 10 14 7.043956 3.852243 20
#> 4 NA NA NA NA NA NA 7.000000 4.200000 20
conv.fivenum(min=min, q1=q1, median=median, q3=q3, max=max, n=n, data=dat, method="bc")
#> case min q1 median q3 max mean sd n
#> 1 1 2 NA 6 NA 14 6.470861 3.262934 20
#> 2 2 NA 4 6 10 NA 8.135197 6.489805 20
#> 3 3 2 4 6 10 14 7.498873 4.978785 20
#> 4 NA NA NA NA NA NA 7.000000 4.200000 20
conv.fivenum(min=min, q1=q1, median=median, q3=q3, max=max, n=n, data=dat, method="mln")
#> case min q1 median q3 max mean sd n
#> 1 1 2 NA 6 NA 14 6.436352 3.468935 20
#> 2 2 NA 4 6 10 NA 7.195510 4.972998 20
#> 3 3 2 4 6 10 14 6.782542 4.098021 20
#> 4 NA NA NA NA NA NA 7.000000 4.200000 20
conv.fivenum(min=min, q1=q1, median=median, q3=q3, max=max, n=n, data=dat, method="blue")
#> case min q1 median q3 max mean sd n
#> 1 1 2 NA 6 NA 14 6.467044 3.187178 20
#> 2 2 NA 4 6 10 NA 6.379034 4.463030 20
#> 3 3 2 4 6 10 14 6.684728 3.571241 20
#> 4 NA NA NA NA NA NA 7.000000 4.200000 20
############################################################################
# example data frame
dat <- data.frame(case=c(1:3,NA), min=c(2,NA,2,NA), q1=c(NA,4,4,NA),
median=c(6,6,6,NA), q3=c(NA,10,14,NA), max=c(14,NA,20,NA),
mean=c(NA,NA,NA,7.0), sd=c(NA,NA,NA,4.2), n=c(20,20,20,20))
dat
#> case min q1 median q3 max mean sd n
#> 1 1 2 NA 6 NA 14 NA NA 20
#> 2 2 NA 4 6 10 NA NA NA 20
#> 3 3 2 4 6 14 20 NA NA 20
#> 4 NA NA NA NA NA NA 7 4.2 20
# for study 3, the third quartile and maximum value suggest that the data have
# a right skewed distribution (they are much further away from the median than
# the minimum and first quartile)
# estimate means/SDs
dat <- conv.fivenum(min=min, q1=q1, median=median, q3=q3, max=max, n=n, data=dat)
dat
#> case min q1 median q3 max mean sd n
#> 1 1 2 NA 6 NA 14 6.594468 3.211576 20
#> 2 2 NA 4 6 10 NA 6.719500 4.787076 20
#> 3 3 2 4 6 14 20 NA NA 20
#> 4 NA NA NA NA NA NA 7.000000 4.200000 20
# note that the mean and SD are not estimated for study 3; this is because the
# test for skewness is significant for this study
dfround(data.frame(attributes(dat$mean)), digits=3)
#> est tval crit sig dist
#> 1 TRUE 0.333 0.416 FALSE norm
#> 2 TRUE 0.333 0.578 FALSE norm
#> 3 FALSE 0.818 0.666 TRUE norm
#> 4 FALSE NA NA NA norm
# estimate means/SDs, but assume that the data for study 3 come from a log-normal distribution
# and back-transform the estimated mean/SD of the log-transformed data back to the raw data
dat <- conv.fivenum(min=min, q1=q1, median=median, q3=q3, max=max, n=n, data=dat,
dist=c("norm","norm","lnorm","norm"), replace="all")
dat
#> case min q1 median q3 max mean sd n
#> 1 1 2 NA 6 NA 14 6.594468 3.211576 20
#> 2 2 NA 4 6 10 NA 6.719500 4.787076 20
#> 3 3 2 4 6 14 20 8.758702 6.740320 20
#> 4 NA NA NA NA NA NA 7.000000 4.200000 20
# this works now because the test for skewness of the log-transformed data is not significant
dfround(data.frame(attributes(dat$mean)), digits=3)
#> est tval crit sig dist
#> 1 TRUE 0.333 0.416 FALSE norm
#> 2 TRUE 0.333 0.578 FALSE norm
#> 3 TRUE 0.353 0.666 FALSE lnorm
#> 4 FALSE NA NA NA norm
```