This functions compute similarities between entries in a document frequency matrix dfm() and return a dataframe with distinct id combinations. It heavily relies on the quanteda package

calc_similarity(data, method, min_sim)

Arguments

data

data as a document frequency matrix dfm() with a set doc_id

method

character; the method identifying the similarity or distance measure to be used, see ?quanteda::textstat_simil

min_sim

numeric; a threshold for the similarity values below which similarity values will not be returned; 0.75-0.8 seems reasonable

Value

dataframe containing the two id's and the similarity value