Substring Matching Algorithms

Tuesday 29 July 2014
Substring matching is an important part of any algorithmic problem. There are different complexities to consider about like the size of the string, the size of the pattern, the running time etc. Considering all these we list down some of the approaches which is used in solving the substring problem.

Naive Brute Force approach :

In the naive approach we iterate over all the elements of the text and start matching for each character of the pattern.If it doesnot match we advance the start of the match of the text by one character and again check for the presence of the pattern. Since for each character of the text we match the next characters for the pattern the worst case time complexity if n is the length of the pattern and m is the length of the text is O(mn). The code is mentioned below :

void naiveBruteForce(string a,string b){
 bool isMatched = true;
 int i=0;
 for(i=0;i<=a.length() - b.length();i++){
  isMatched = true;
   for(int j=0;j<b.length();j++){
    if(a[i+j] != b[j])
     isMatched = false;
   }
  if(isMatched)
   break;
 }
 cout << isMatched << " offset :" <<  i << endl;
}

Rabin Karp Algorithm :

In the Rabin Karp method we try to eliminate the two inner loop computation of the naive algorithm in which we try to match two strings. Instead of matching two strings we assign a value by using a hash of the characters. We precompute the hash of the pattern and we only check the if the hash of the substring of the text matches the pattern's hash value. Again problem will arise if the number of characters is big or the length of the string is big in which case we can use a modular hash function. But in this two strings may have the same value even if they do not match, thus the modulus M should be taken a large number. A sample code in which we have taken squares of the ascii value of the characters as a hash value is given below:

void rabinKarp(string a,string b){
 int hashsub = 0,tmp,tmp1,hash = 0;
 for(int i=0;i&ltb.length();i++){
  tmp = b[i];
  tmp1 = a[i];
  hashsub += tmp * tmp;
  hash += tmp1 * tmp1;
 }
 for(int i=b.length();i&lta.length();i++){
  if(hash==hashsub)
   cout << "offset :" << i - b.length() << endl;
  tmp = a[i - b.length()];
  tmp1 = a[i];
  hash -= (tmp * tmp);
  hash += (tmp1 * tmp1);
 }
}

Knuth-Morris-Pratt Algorithm:

In KMP algorithm we use the fact that if a substring match fails then we know by precomputing the pattern that we can certainly shift by a certain amount in which the substring wont match. The precomputing is done by finding the longest substring of the pattern that is a prefix as well as a suffix. This can be done in the following way- Consider a string in the pattern b1b2...bk...bj. Sup pose we know that the longest substring that is a prefix as well as a suffix is the substring upto k. Then F[j] = k, and if k+1th character is equal to the j+1th character then F[j+1] = F[j] + 1. Else there would be some other substring smaller than bk that would be initial and terminal substring then it would also be a border string of b1...bk. Hence if the above condition is false then we check for the string F[F[j]] for the initial and terminal and this check continues until we find a border or there is no border at all in which case we inialize it to  . The pseudocode is as follows:
Prefix-Compute-Function(Pattern):
F[1] = 0;
for j from 1 to m-1 do:
    i = F[j];
    while(bi+1 != bj+1 and i>0) i=F[i]; //reduce the search space
    if(i==0 and bi+1 != bj+1) F[j+1] = 0;
    else
    F[j+1] = i+1;

The C implimentation is given below :

int* KMP_Preprocess(string pattern){
 int *pre = (int *)malloc(sizeof(int)*pattern.length());
 memset(pre,0,pattern.length());
 for(int i=1;i<pattern.length();i++){
  pre[i] = pre[i-1];
  while(pattern[i] != pattern[pre[i]]){
   if(pre[i] == 0){
    pre[i] = -1;
    break;
   }else{
    pre[i] = pre[pre[i] - 1];
   }
  }
  pre[i] = pre[i] + 1;
 }
 return pre;
}

void KMP_Match(string a, string b){
 int *pre = KMP_Preprocess(b);
 for(int i=0,j=0;i<a.length();){
  while(a[i+j] == b[j]){
   ++j;
   if(j==b.length()){
    cout << "Match offset :" << i << endl;
    break;
   }
  }
  if(j>0){
   i+=(j-pre[j-1] > 1)?(j-pre[j-1]):1;
   j = pre[j-1];
  }else{
   ++i;
  }
 }
}

Copyright @ 2013 code-craft. Designed by Templateism | MyBloggerLab