create_fingerprint.Rd
This function creates a fingerprint of a string. This can be used for de-duplication or calculation of string similarity or string distance. It is bases on normalised tokens and implements Open Refine's clustering algorithm, precisly the Fingerprint Key Collision See https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
create_fingerprint(string, tokens = "word", n = NULL)
input string
how to generate tokens? word
for whitespace-separated tokens, ngram
for ngrams/shingles
The number of characters in each shingle. If token = "ngram"
a n
must be provided
character string
create_fingerprint("Max Spohr Verlag", token = "word")
#> [1] "maxspohrverlag"
create_fingerprint("Max Spohr Verlag", token = "ngram", n = 2)
#> [1] "agaxerhrlamaohporlrvspvexs"