G Statistic and Chi-Squared Statistic

Gstat computes the G statistic.

chi2stat computes the Pearson chi-squared statistic.

Gstatindep computes the G statistic between the empirical observed joint distribution and the product distribution obtained from its marginals.

chi2statindep computes the Pearson chi-squared statistic of independence.

Gstat(y, freqs, unit=c("log", "log2", "log10"))
chi2stat(y, freqs, unit=c("log", "log2", "log10"))
Gstatindep(y2d, unit=c("log", "log2", "log10"))
chi2statindep(y2d, unit=c("log", "log2", "log10"))

Arguments

y: observed vector of counts.
freqs: vector of expected frequencies (probability mass function). Alternatively, counts may be provided.
y2d: matrix of counts.
unit: the unit in which entropy is measured. The default is "nats" (natural units). For computing entropy in "bits" set unit="log2".

Details

The observed counts in y and y2d are used to determine the total sample size.

The G statistic equals two times the sample size times the KL divergence between empirical observed frequencies and expected frequencies.

The Pearson chi-squared statistic equals sample size times chi-squared divergence between empirical observed frequencies and expected frequencies. It is a quadratic approximation of the G statistic.

The G statistic between the empirical observed joint distribution and the product distribution obtained from its marginals is equal to two times the sample size times mutual information.

The Pearson chi-squared statistic of independence equals the Pearson chi-squared statistic between the empirical observed joint distribution and the product distribution obtained from its marginals. It is a quadratic approximation of the corresponding G statistic.

The G statistic and the Pearson chi-squared statistic are asymptotically chi-squared distributed which allows to compute corresponding p-values.

Value

A list containing the test statistic stat, the degree of freedom df used to calculate the p-value pval.

Author

Korbinian Strimmer (https://strimmerlab.github.io).

Examples

# load entropy library 
library("entropy")

## one discrete random variable

# observed counts in each class
y = c(4, 2, 3, 1, 6, 4)
n = sum(y) # 20

# expected frequencies and counts
freqs.expected = c(0.10, 0.15, 0.35, 0.05, 0.20, 0.15)
y.expected = n*freqs.expected


# G statistic (with p-value) 
Gstat(y, freqs.expected) # from expected frequencies
#> $stat
#> [1] 6.006568
#> 
#> $df
#> [1] 5
#> 
#> $pval
#> [1] 0.3055804
#> 
Gstat(y, y.expected) # alternatively from expected counts
#> $stat
#> [1] 6.006568
#> 
#> $df
#> [1] 5
#> 
#> $pval
#> [1] 0.3055804
#> 

# G statistic computed from empirical KL divergence
2*n*KL.empirical(y, y.expected)
#> [1] 6.006568


## Pearson chi-squared statistic (with p-value) 
# this can be viewed an approximation of the G statistic
chi2stat(y, freqs.expected) # from expected frequencies
#> $stat
#> [1] 5.952381
#> 
#> $df
#> [1] 5
#> 
#> $pval
#> [1] 0.3108801
#> 
chi2stat(y, y.expected) # alternatively from expected counts
#> $stat
#> [1] 5.952381
#> 
#> $df
#> [1] 5
#> 
#> $pval
#> [1] 0.3108801
#> 

# computed from empirical chi-squared divergence
n*chi2.empirical(y, y.expected)
#> [1] 5.952381

# compare with built-in function
chisq.test(y, p = freqs.expected) 
#> Warning: Chi-squared approximation may be incorrect
#> 
#> 	Chi-squared test for given probabilities
#> 
#> data:  y
#> X-squared = 5.9524, df = 5, p-value = 0.3109
#> 


## joint distribution of two discrete random variables

# contingency table with counts
y.mat = matrix(c(4, 5, 1, 2, 4, 4), ncol = 2)  # 3x2 example matrix of counts
n.mat = sum(y.mat) # 20


# G statistic between empirical observed joint distribution and product distribution
Gstatindep( y.mat )
#> $stat
#> [1] 2.718385
#> 
#> $df
#> [1] 2
#> 
#> $pval
#> [1] 0.2568682
#> 

# computed from empirical mutual information
2*n.mat*mi.empirical(y.mat)
#> [1] 2.718385


# Pearson chi-squared statistic of independence
chi2statindep( y.mat )
#> $stat
#> [1] 2.577778
#> 
#> $df
#> [1] 2
#> 
#> $pval
#> [1] 0.2755768
#> 

# computed from empirical chi-square divergence
n.mat*chi2indep.empirical(y.mat)
#> [1] 2.577778

# compare with built-in function
chisq.test(y.mat) 
#> Warning: Chi-squared approximation may be incorrect
#> 
#> 	Pearson's Chi-squared test
#> 
#> data:  y.mat
#> X-squared = 2.5778, df = 2, p-value = 0.2756
#>