Function to reconstruct the cell frequencies of \(2 \times 2\) tables based on other summary statistics.

conv.2x2(ori, ri, x2i, ni, n1i, n2i, sens, spec, ppv, npv, correct=TRUE, drop01=TRUE, data,
         include, var.names=c("ai","bi","ci","di"), append=TRUE, replace="ifna", ...)

Arguments

ori

optional vector with the odds ratios corresponding to the tables.

ri

optional vector with the phi coefficients corresponding to the tables.

x2i

optional vector with the (signed) chi-square statistics corresponding to the tables.

ni

vector with the total sample sizes.

n1i

vector with the marginal counts for the outcome of interest on the first variable.

n2i

vector with the marginal counts for the outcome of interest on the second variable.

sens

optional vector with the sensitivities corresponding to the tables.

spec

optional vector with the specificities corresponding to the tables.

ppv

optional vector with the positive predictive values corresponding to the tables.

npv

optional vector with the negative predictive values corresponding to the tables.

correct

optional logical (or vector thereof) to specify whether chi-square statistics were computed using Yates's correction for continuity (the default is TRUE).

drop01

logical to specify whether studies where sens, spec, ppv, and/or npv is equal to 0 or 1 should be dropped (the default is TRUE).

data

optional data frame containing the variables given to the arguments above.

include

optional (logical or numeric) vector to specify the subset of studies for which the cell frequencies should be reconstructed.

var.names

character vector with four elements to specify the names of the variables for the reconstructed cell frequencies (the default is c("ai","bi","ci","di")).

append

logical to specify whether the data frame provided via the data argument should be returned together with the reconstructed values (the default is TRUE).

replace

character string or logical to specify how values in var.names should be replaced (only relevant when using the data argument and if variables in var.names already exist in the data frame). See the ‘Value’ section for more details.

...

other arguments.

Details

For meta-analyses based on \(2 \times 2\) table data, the problem often arises that some studies do not directly report the cell frequencies. The present function allows the reconstruction of such tables based on other summary statistics.

In particular, assume that the data of interest for a particular study are of the form:

variable 2, outcome +variable 2, outcome -total
variable 1, outcome +aibin1i
variable 1, outcome -cidi
totaln2ini

where ai, bi, ci, and di denote the cell frequencies (i.e., the number of individuals falling into a particular category), n1i (i.e., ai+bi) and n2i (i.e., ai+ci) are the marginal totals for the outcome of interest on the first and second variable, respectively, and ni is the total sample size (i.e., ai+bi+ci+di) of the study.

For example, if variable 1 denotes two different groups (e.g., treated versus control) and variable 2 indicates whether a particular outcome of interest has occurred or not (e.g., death, complications, failure to improve under the treatment), then n1i denotes the number of individuals in the treatment group, but n2i is not the number of individuals in the control group, but the total number of individuals who experienced the outcome of interest on variable 2. Note that the meaning of n2i is therefore different here compared to the escalc function (where n2i denotes ci+di).

If a study does not report the cell frequencies, but it reports the total sample size (which can be specified via the ni argument), the two marginal counts (which can be specified via the n1i and n2i arguments), and some other statistic corresponding to the table, then it may be possible to reconstruct the cell frequencies. The present function currently allows this for three different cases:

  1. If the odds ratio \[OR = \frac{a_i d_i}{b_i c_i}\] is known, then the cell frequencies can be reconstructed (Bonett, 2007). Odds ratios can be specified via the ori argument.

  2. If the phi coefficient \[\phi = \frac{a_i d_i - b_i c_i}{\sqrt{n_{1i}(n_i-n_{1i})n_{2i}(n_i-n_{2i})}}\] is known, then the cell frequencies can again be reconstructed (own derivation). Phi coefficients can be specified via the ri argument.

  3. If the chi-square statistic from Pearson's chi-square test of independence is known (which can be specified via the x2i argument), then it can be used to recalculate the phi coefficient and hence again the cell frequencies can be reconstructed. However, the chi-square statistic does not carry information about the sign of the phi coefficient. Therefore, values specified via the x2i argument can be positive or negative, which allows the specification of the correct sign. Also, when using a chi-square statistic as input, it is assumed that it was computed using Yates's correction for continuity (unless correct=FALSE). If the chi-square statistic is not known, but its p-value, one can first back-calculate the chi-square statistic using qchisq(<p-value>, df=1, lower.tail=FALSE).

Typically, the odds ratio, phi coefficient, or chi-square statistic (or its p-value) that can be extracted from a study will be rounded to a certain degree. The calculations underlying the function are exact only for unrounded values. Rounding can therefore introduce some discrepancies between the actual cell frequencies and the reconstructed ones.

If a marginal total is unknown, then external information needs to be used to ‘guestimate’ the number of individuals that experienced the outcome of interest on this variable. Depending on the accuracy of such an estimate, the reconstructed cell frequencies will be more or less accurate and need to be treated with due caution.

The true marginal counts also put constraints on the possible values for the odds ratio, phi coefficient, and chi-square statistic. If a marginal count is replaced by a guestimate which is not compatible with the given statistic, one or more reconstructed cell frequencies may be negative. The function issues a warning if this happens and sets the cell frequencies to NA for such a study.

If only one of the two marginal counts is unknown but a 95% CI for the odds ratio is also available, then the estimraw package can also be used to reconstruct the corresponding cell frequencies (Di Pietrantonj, 2006; but see Veroniki et al., 2013, for some cautions).

Diagnostic Studies

The present function can also be used to reconstruct \(2 \times 2\) table data for diagnostic studies. Here, the table is assumed to be of the form:

casecontrol
test+aibi
test-cidi

where ai denotes the number of true positives, bi the number of false positives, ci the number of false negatives, and di the number of true negatives.

If the total sample size (ni = ai + bi + ci + di) and the marginal totals are known (i.e, the number of positive tests, n1i, and the number of cases, n2i), then the diagnostic odds ratio would be sufficient to reconstruct the table and the present function can be used as described above.

On the other hand, if the total sample size of the study (i.e., ni = ai + bi + ci + di), the sensitivity (i.e., sens = ai / (ai + ci)), specificity (i.e., spec = di / (bi + di)), positive predictive value (i.e., ppv = ai / (ai + bi)), and negative predictive value (i.e., npv = di / (ci + di)) are known, then this is also sufficient information to recreate the table. Actually, only three of the four diagnostic accuracy measures are needed to reconstruct the table.

In practice, when such accuracy measures are reported, the values are typically rounded to some extent. This introduces inaccuracies into the reconstruction. The present function uses optimization methods to reconstruct the table counts so that the discrepancy between the reported measures and the reconstructed ones are minimized. This is not guaranteed to reconstruct the actual table exactly, but should usually yield a close match, especially if all four measures are available. In some rare cases, the reconstruction may also fail even if all four measures are reported. This often happens if at least one of the reported accuracy measures is equal to 0 or 1. By default (i.e., when drop01=TRUE), such studies are automatically dropped (i.e., the reconstructed cell frequencies are set to NA).

Value

If the data argument was not specified or append=FALSE, a data frame with four variables called var.names with the reconstructed cell frequencies.

If data was specified and append=TRUE, then the original data frame is returned. If var.names[j] (for \(\text{j} \in \{1, \ldots, 4\}\)) is a variable in data and replace="ifna" (or replace=FALSE), then only missing values in this variable are replaced with the estimated frequencies (where possible) and otherwise a new variable called var.names[j] is added to the data frame.

If replace="all" (or replace=TRUE), then all values in var.names[j] where a reconstructed cell frequency can be computed are replaced, even for cases where the value in var.names[j] is not missing.

References

Bonett, D. G. (2007). Transforming odds ratios into correlations for meta-analytic research. American Psychologist, 62(3), 254–255. https://doi.org/10.1037/0003-066x.62.3.254

Di Pietrantonj, C. (2006). Four-fold table cell frequencies imputation in meta analysis. Statistics in Medicine, 25(13), 2299–2322. https://doi.org/10.1002/sim.2287

Veroniki, A. A., Pavlides, M., Patsopoulos, N. A., & Salanti, G. (2013). Reconstructing 2 x 2 contingency tables from odds ratios using the Di Pietrantonj method: Difficulties, constraints and impact in meta-analysis results. Research Synthesis Methods, 4(1), 78–94. https://doi.org/10.1002/jrsm.1061

Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3), 1–48. https://doi.org/10.18637/jss.v036.i03

See also

escalc for a function to compute various effect size measures based on \(2 \times 2\) table data.

Examples

############################################################################

### demonstration that the reconstruction of the 2x2 table works
### (note: the values in rows 2, 3, and 4 correspond to those in row 1)
dat <- data.frame(ai=c(36,NA,NA,NA), bi=c(86,NA,NA,NA), ci=c(20,NA,NA,NA), di=c(98,NA,NA,NA),
                  oddsratio=NA, phi=NA, chisq=NA, ni=NA, n1i=NA, n2i=NA)
dat$oddsratio[2] <- round(exp(escalc(measure="OR", ai=ai, bi=bi, ci=ci, di=di, data=dat)$yi[1]), 2)
dat$phi[3] <- round(escalc(measure="PHI", ai=ai, bi=bi, ci=ci, di=di, data=dat)$yi[1], 2)
dat$chisq[4] <- round(chisq.test(matrix(c(t(dat[1,1:4])), nrow=2, byrow=TRUE))$statistic, 2)
dat$ni[2:4]  <- with(dat, ai[1] + bi[1] + ci[1] + di[1])
dat$n1i[2:4] <- with(dat, ai[1] + bi[1])
dat$n2i[2:4] <- with(dat, ai[1] + ci[1])
dat
#>   ai bi ci di oddsratio  phi chisq  ni n1i n2i
#> 1 36 86 20 98        NA   NA    NA  NA  NA  NA
#> 2 NA NA NA NA      2.05   NA    NA 240 122  56
#> 3 NA NA NA NA        NA 0.15    NA 240 122  56
#> 4 NA NA NA NA        NA   NA  4.61 240 122  56

### reconstruct cell frequencies for rows 2, 3, and 4
dat <- conv.2x2(ri=phi, ori=oddsratio, x2i=chisq, ni=ni, n1i=n1i, n2i=n2i, data=dat)
dat
#>   ai bi ci di oddsratio  phi chisq  ni n1i n2i
#> 1 36 86 20 98        NA   NA    NA  NA  NA  NA
#> 2 36 86 20 98      2.05   NA    NA 240 122  56
#> 3 36 86 20 98        NA 0.15    NA 240 122  56
#> 4 36 86 20 98        NA   NA  4.61 240 122  56

### same example but with cell frequencies that are 10 times as large
dat <- data.frame(ai=c(360,NA,NA,NA), bi=c(860,NA,NA,NA), ci=c(200,NA,NA,NA), di=c(980,NA,NA,NA),
                  oddsratio=NA, phi=NA, chisq=NA, ni=NA, n1i=NA, n2i=NA)
dat$oddsratio[2] <- round(exp(escalc(measure="OR", ai=ai, bi=bi, ci=ci, di=di, data=dat)$yi[1]), 2)
dat$phi[3] <- round(escalc(measure="PHI", ai=ai, bi=bi, ci=ci, di=di, data=dat)$yi[1], 2)
dat$chisq[4] <- round(chisq.test(matrix(c(t(dat[1,1:4])), nrow=2, byrow=TRUE))$statistic, 2)
dat$ni[2:4]  <- with(dat, ai[1] + bi[1] + ci[1] + di[1])
dat$n1i[2:4] <- with(dat, ai[1] + bi[1])
dat$n2i[2:4] <- with(dat, ai[1] + ci[1])
dat <- conv.2x2(ri=phi, ori=oddsratio, x2i=chisq, ni=ni, n1i=n1i, n2i=n2i, data=dat)
dat # slight inaccuracy in row 3 due to rounding
#>    ai  bi  ci  di oddsratio  phi chisq   ni  n1i n2i
#> 1 360 860 200 980        NA   NA    NA   NA   NA  NA
#> 2 360 860 200 980      2.05   NA    NA 2400 1220 560
#> 3 361 859 199 981        NA 0.15    NA 2400 1220 560
#> 4 360 860 200 980        NA   NA 52.19 2400 1220 560

### demonstrate what happens when a true marginal count is guestimated
escalc(measure="PHI", ai=176, bi=24, ci=72, di=128)
#> 
#>       yi     vi 
#> 1 0.5357 0.0017 
#> 
conv.2x2(ri=0.54, ni=400, n1i=200, n2i=248) # using the true marginal counts
#>    ai bi ci  di
#> 1 176 24 72 128
conv.2x2(ri=0.54, ni=400, n1i=200, n2i=200) # marginal count for variable 2 is guestimated
#>    ai bi ci  di
#> 1 154 46 46 154
conv.2x2(ri=0.54, ni=400, n1i=200, n2i=50)  # marginal count for variable 2 is incompatible
#> Warning: There are negative cell frequencies in table 1.
#>   ai bi ci di
#> 1 NA NA NA NA

### demonstrate that using the correct sign for the chi-square statistic is important
chisq <- round(chisq.test(matrix(c(40,60,60,40), nrow=2, byrow=TRUE))$statistic, 2)
conv.2x2(x2i=-chisq, ni=200, n1i=100, n2i=100) # correct reconstruction
#>   ai bi ci di
#> 1 40 60 60 40
conv.2x2(x2i=chisq, ni=200, n1i=100, n2i=100) # incorrect reconstruction
#>   ai bi ci di
#> 1 60 40 40 60

### demonstrate use of the 'correct' argument
tab <- matrix(c(28,14,12,18), nrow=2, byrow=TRUE)
chisq <- round(chisq.test(tab)$statistic, 2) # chi-square test with Yates' continuity correction
conv.2x2(x2i=chisq, ni=72, n1i=42, n2i=40) # correct reconstruction
#>   ai bi ci di
#> 1 28 14 12 18
chisq <- round(chisq.test(tab, correct=FALSE)$statistic, 2) # without Yates' continuity correction
conv.2x2(x2i=chisq, ni=72, n1i=42, n2i=40) # incorrect reconstruction
#>   ai bi ci di
#> 1 29 13 11 19
conv.2x2(x2i=chisq, ni=72, n1i=42, n2i=40, correct=FALSE) # correct reconstruction
#>   ai bi ci di
#> 1 28 14 12 18

### recalculate chi-square statistic based on p-value
pval <- round(chisq.test(tab)$p.value, 2)
chisq <- qchisq(pval, df=1, lower.tail=FALSE)
conv.2x2(x2i=chisq, ni=72, n1i=42, n2i=40)
#>   ai bi ci di
#> 1 28 14 12 18

############################################################################

### reconstruct the 2x2 table counts for a diagnostic study
tab <- matrix(c(28,5,7,18), nrow=2, byrow=TRUE)
tab
#>      [,1] [,2]
#> [1,]   28    5
#> [2,]    7   18

### reconstruct from the diagnostic odds ratio and the marginals
dor   <- tab[1,1] * tab[2,2] / (tab[1,2] * tab[2,1])
cases <- tab[1,1] + tab[2,1]
pos   <- tab[1,1] + tab[1,2]
n <- sum(tab)
conv.2x2(ori=dor, n1i=pos, n2i=cases, ni=n)
#>   ai bi ci di
#> 1 28  5  7 18

### reconstruct from the diagnostic accuracy measures
sens <- tab[1,1] / sum(tab[,1])
spec <- tab[2,2] / sum(tab[,2])
ppv  <- tab[1,1] / sum(tab[1,])
npv  <- tab[2,2] / sum(tab[2,])
n    <- sum(tab)
conv.2x2(sens=sens, spec=spec, ppv=ppv, npv=npv, ni=n)
#>   ai bi ci di
#> 1 28  5  7 18

### show that only three out of the four diagnostic statistics are needed
conv.2x2(sens=sens, spec=spec, ppv=ppv, ni=n)
#>   ai bi ci di
#> 1 28  5  7 18
conv.2x2(sens=sens, spec=spec, npv=npv, ni=n)
#>   ai bi ci di
#> 1 28  5  7 18
conv.2x2(sens=sens, ppv=ppv, npv=npv, ni=n)
#>   ai bi ci di
#> 1 28  5  7 18
conv.2x2(spec=spec, ppv=ppv, npv=npv, ni=n)
#>   ai bi ci di
#> 1 28  5  7 18

### reconstruct the 2x2 table counts from rounded statistics
dor  <- round(dor,  digits=2)
sens <- round(sens, digits=2)
spec <- round(spec, digits=2)
ppv  <- round(ppv,  digits=2)
npv  <- round(npv,  digits=2)
conv.2x2(ori=dor, n1i=pos, n2i=cases, ni=n)
#>   ai bi ci di
#> 1 28  5  7 18
conv.2x2(sens=sens, spec=spec, ppv=ppv, npv=npv, ni=n)
#>   ai bi ci di
#> 1 28  5  7 18

############################################################################