This function calculates various association metrics (PMI, Dice's Coefficient, G-score) for bigrams in a given corpus.
calc_assoc_metrics(
data,
doc_index,
token_index,
type,
association = "all",
verbose = FALSE
)A data frame containing the corpus.
Column in 'data' which represents the document index.
Column in 'data' which represents the token index.
Column in 'data' which represents the tokens or terms.
A character vector specifying which metrics to calculate. Can be any combination of 'pmi', 'dice_coeff', 'g_score', or 'all'. Default is 'all'.
A logical value indicating whether to keep the intermediate probability columns. Default is FALSE.
A data frame with one row per bigram and columns for each calculated metric.
data_path <- system.file("extdata", "bigrams_data.rds", package = "qtkit")
data <- readRDS(data_path)
calc_assoc_metrics(data, doc_index, token_index, type)
#> y x n pmi dice_coeff g_score
#> 1 word2 word1 1 1.7917595 0.6857143 -0.6061358
#> 2 word2 word3 1 0.6931472 0.4210526 -1.7047481
#> 3 word3 word2 2 1.3862944 0.8275862 -0.3184537
#> 4 word3 word4 1 0.2876821 0.3478261 -2.1102132
#> 5 word4 word3 2 0.9808293 0.6956522 -0.7239188
#> 6 word4 word5 1 0.6931472 0.4137931 -1.7047481
#> 7 word5 word4 2 1.3862944 0.8421053 -0.3184537
#> 8 word6 word5 1 1.7917595 0.7058824 -0.6061358