From 08516232aa30a89aa6fe4880761f40fb75aa1404 Mon Sep 17 00:00:00 2001 From: Hartmut Holzgraefe Date: Wed, 27 Sep 2000 20:41:17 +0000 Subject: [PATCH] more on Levenshtein ... git-svn-id: https://svn.php.net/repository/phpdoc/en/trunk@33019 c90b9560-bf6c-de11-be94-00142212c4b1 --- functions/strings.xml | 109 +++++++++++++++++++++++++++++++++++++----- 1 file changed, 98 insertions(+), 11 deletions(-) diff --git a/functions/strings.xml b/functions/strings.xml index b22e9ae44a..7fdabd0429 100644 --- a/functions/strings.xml +++ b/functions/strings.xml @@ -930,15 +930,31 @@ $colon_separated = implode (":", $array); Description - int levenshtein - string str1 - string str2 + int levenshtein + string str1 + string str2 + + + int levenshtein + string str1 + string str2 + int cost_ins + int cost_rep + int cost_del + + + int levenshtein + string str1 + string str2 + function cost - This function return the Levenshtein-Distance between the two - argument strings or -1, if one of the argument strings is longer - than the limit of 255 characters. + This function return the Levenshtein-Distance between the + two argument strings or -1, if one of the argument strings + is longer than the limit of 255 characters (255 should be + more than enough for name or dictionary comarison, and + nobody serious would be doing genetic analysis with PHP). The Levenshtein distance is defined as the minimal number of @@ -948,13 +964,84 @@ $colon_separated = implode (":", $array); where n and m are the length of str1 and str2 (rather good when compared to - similar_text, which is O(max(n,m)**3), but - still expensive). + similar_text, which is O(max(n,m)**3), + but still expensive). + + In its simpelest form the function will take only the two + strings as parameter and will calculate just the number of + insert, replace and delete operations needed to transform + str1 into str2. + + + A second variant will take three additional parameters that + define the cost of insert, replace and delete operations. + This is more general and adaptive than variant one, but not + as efficient. + + + The third variant (which is not implemented yet) will be + the most general and adaptive, but also the slowest alternative. + It will call a user-supplied function that will determine the + cost for every possible operation. + + + The user-supplied function will be called with the following + arguments: + + + + operation to apply: 'I', 'R' or 'D' + + + + + actual character in string 1 + + + + + actual character in string 2 + + + + + position in string 1 + + + + + position in string 2 + + + + + remaining characters in string 1 + + + + + remaining characters in string 2 + + + + The user-supplied function has to return a positive integer + describing the cost for this particular operation, but it + may decide to use only some of the supplied arguments. + + + The user-suplied function approach offers the possibility to + take into account the relevance of and/or difference between + certain symbols (characters) or even the context those symbols + appear in to determine the cost of insert, replace and delete + operations, but at the cost of loosing all optimizations done + regarding cpu register utilization and cache misses that have + been worked into the other two variants. + - See also soundex, - similar_text and - metaphone. + See also soundex, + similar_text + and metaphone.