Calculate Similarities — calc_similarity • kabrutils

This functions compute similarities between entries in a document frequency matrix dfm() and return a dataframe with distinct id combinations. It heavily relies on the quanteda package

calc_similarity(data, method, min_sim)

Arguments

data: data as a document frequency matrix dfm() with a set doc_id
method: character; the method identifying the similarity or distance measure to be used, see ?quanteda::textstat_simil
min_sim: numeric; a threshold for the similarity values below which similarity values will not be returned; 0.75-0.8 seems reasonable

Value

dataframe containing the two id's and the similarity value