string-similarity()
This function is deprecated.
This function executes a fuzzy text comparison. Instead of exact matches, similar texts are searched for. A double value between 0.0 and 1.0 is used as a measure of similarity. The function is designed for similar and not for dissimilar texts. Therefore, the comparison is cancelled if it becomes apparent that the similarity value is smaller than the threshold specified in parameter e. In that case, the similarity value 0.0 is returned.
This function has three operating modes (methods) that are set with a keyword in parameter c. The first letter of the keyword is enough.
Method |
Behaviour of the function if this method is chosen |
single |
String a is compared to string b. The result is a double value between 0.0 and 1.0 that describes the amount of similarity. |
multi |
String a is compared to a list of strings. Parameter b expects the name of the list. The result value is the text from the list that is the most similar to string a or an empty string ("") of length 0 with set Empty Flag if no text reaches the threshold e. The texts in list b are sorted by their similarity to a in descending order. Texts that do not reach the threshold e are removed from the list. Note: The list is modified by the function! |
collection |
Like method multi, but additionally a map with the name found as the value in parameter b is created that contains pairs of the texts in the list and their assigned similarity values. Texts that do not reach the similarity threshold e are removed as well. After the function has been executed, the list with name b only contains the similar texts in descending order of their similarity to a. The map with the same name contains these texts and the assigned similarity (double) value. |
Parameters for method "single"
Parameter |
Description |
a |
Text to be compared. |
b |
Text with which the comparison is made. |
c |
(optional) The method, here single. The first letter s is enough. Default: single. |
d |
(optional) Flags that control the details of the comparison. Default: skip. |
e |
(optional) Similarity threshold between 0.1 and 1.0. The comparison is cancelled if a similarity value below the threshold becomes apparent. Default: 0.3. Parameter values <= 0 are replaced by the default value. |
Parameters for method "multi" and "collection"
Parameter |
Description |
a |
Text to be compared. |
b |
Name of the list that contains the other texts for the comparison. One text in each list entry. |
c |
The method, here multi or collection. The first letters m or c are enough. Default: single. |
d |
Flags that control the details of the comparison. Default: skip. |
e |
Similarity threshold between 0.1 and 1.0. The comparison is cancelled if a similarity value below the threshold becomes apparent. Default: 0.3. Parameter values <= 0 are replaced by the default value. |
Flags to control the details of the comparison
Currently, two flags are supported: "ignorecase" and "skip".
Since the function can be extended by additional rule classes that are declared with complete class names in file ./etc/admin/datawizard/NameSimilarityRules.properties, additional custom flags can be defined.
A flag is active if parameter e contains the flag keyword. The default value for parameter e is skip. If parameter e has any value, flag skip will no longer be active, unless it is contained in the value (e.g. ignorecase+skip).
Flag |
Comparison rule |
ignorecase |
Lower case letters will be seen as a match to the corresponding uppercase characters with a similarity value of 0.99. |
skip |
The character-wise comparison from left to right can skip individual characters if those characters are seen as incomparable by preceding rules (if they do not match and no rule defines a similarity value for that case). The comparison resumes after the skipped character. If the text to be compared contains 10 characters, 2 characters are skipped and the remaining 8 characters match, the similarity value is 8/10 = 0.8. |
Example: The comparison of TEXT8 and text8 with active flag ignorecase returns the result 0.96059601 (0.99*0.99*0.99*0.99*1.0). Without the flag ignorecase, but with flag skip, the first 4 characters are skipped and only the last characters 8 provide a match. The result would be 1/5 = 0.2. If the threshold value e is bigger than 0.2, the comparison will be cancelled and the result will be 0.0. Without flag skip, the comparison will be cancelled after the first character and the result will be 0.0 as well.
Examples
Parameter a |
Parameter b |
Parameter c |
Parameter d |
Parameter e |
Result |
Explanation |
Lobster_test |
Lobster |
single |
skip |
|
7/12 = 0.5833 |
7 of 12 characters match. After that, no further matches can be found. The threshold of 0.3 (default) is not undershot. |
Lobster_test |
Lobster |
single |
skip |
0.8 |
0.0 |
As above, but the threshold 0.8 is undershot and therefore the comparison is terminated with result 0.0. |
Lobster_test |
LOBSTER_test |
single |
ignorecase |
0.612345 |
(0.99) to the power of 7 = 0.932065 |
Each of the seven comparisons of lowercase and uppercase letters delivers factor 0.99. The rest delivers factor 1.0. The threshold 0.612345 is not undershot. |
Lobster_test |
Lobster-test |
single |
ignorecase |
0.2 |
7/12 = 0.5833 |
7 of 12 characters match. Because of missing rule skip, the comparison cannot be continued after the first mismatch. |
Lobster_test |
Lobster-test |
single |
ignorecase+skip |
0.2 |
11/12 = 0.5833 |
11 of 12 characters match. |
Lobster_test |
Lobstertest |
single |
ignorecase+skip |
0.2 |
11/12 = 0.5833 |
11 of 12 characters match. |
In methods multi and collection the comparison is done like above with every single list entry of the list specified in parameter b. If the threshold e is undershot, the list entry is removed. Then the list entries are sorted by their similarity value in descending order (so the most similar first). The text of the first list entry is returned by the function.
Hints for debugging in the mapping
Especially with methods multi and collection the function is very complex and, therefore, the expectation of the profile developer could differ from the actual behaviour. To analyse the sorting in the list before and after the function call, function "dump list()" can be used to write the list content into the log. Additionally, the method collection can be used and the content of the map can be written in the log as well with function "dump map ()".