This function creates a fingerprint of a string. This can be used for de-duplication or calculation of string similarity or string distance. It is bases on normalised tokens and implements Open Refine's clustering algorithm, precisly the Fingerprint Key Collision See https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth

create_fingerprint(string, tokens = "word", n = NULL)

Arguments

string

input string

tokens

how to generate tokens? word for whitespace-separated tokens, ngram for ngrams/shingles

n

The number of characters in each shingle. If token = "ngram" a n must be provided

Value

character string

Examples

create_fingerprint("Max Spohr Verlag", token = "word")
#> [1] "maxspohrverlag"
create_fingerprint("Max Spohr Verlag", token = "ngram", n = 2)
#> [1] "agaxerhrlamaohporlrvspvexs"