Probability of Finding Same Substring Again

Preliminary Definitions

A cord is a sequence of characters. In our model we are going to correspond a string equally a 0-indexed assortment. Then a string S = "Galois" is indeed an array ['M', 'a', 'fifty', 'o', 'i', 's']. The number of characters of a string is chosen its length and is denoted past |S|. If we desire to reference the character of the cord at position i, nosotros will use S[i].

A substring is a sequence of consecutive contiguous elements of a string, we will denote the substring starting at i and ending at j of string S by S[i...j].

A prefix of a cord Due south is a substring that starts at position 0, and a suffix a substring that ends at |S|-1. A proper prefix of a S is a prefix that is different to South. Similarly, a proper suffix of S is a suffix that is different to South. The + operator will stand for string concatenation.

Case:

          Due south="Galois"     |S|=six     Southward[0]='G', Due south[1]='a', S[2]='50',...,S[5]='s'     S[1...four]="aloi"     W="Evariste"     W+South="EvaristeGalois"

A Needle in the haystack (KMP algorithm)

Given a text T we are interested in calculating all the occurrences of a pattern P.

This simple problem has a lot of applications. For example, the text can be the nucleotide sequence of the man Deoxyribonucleic acid and the pattern a genomic sequence of some genetic disease, or the text tin can be all the cyberspace, and the blueprint a query (the Google problem), etc.

We are going to study the exact string matching problem, that is given two strings T and P nosotros want to find all substrings of T that are equal to P. Formally, we want to calculate all indices i such that T[i+s] = P[s] for each 0 ≤ due south ≤ |P|-ane.

In the following example, y'all are given a string T and a pattern P, and all the occurrences of P in T are highlighted in scarlet.

String matching on DNA sequence

One easy way to solve the trouble, is to iterate over all i from 0to |T| - |P|, and check if there is a substring of T that starts at i and matches with P:

          def find_occurrences(t,p):    lt,lp=len(t),len(p)    for i in range(lt-lp+one):       lucifer=True       for fifty in range(lp):          if t[i+l]!=p[l]:             lucifer=Simulated             interruption       if match: print i

Unfortunately this solution is very slow, its time complexity is O(|T|·|P|), nevertheless it gives us some insights.

Suppose that we are finding a lucifer starting at position i on the text, and so there are 2 possibilities:

We don't find a match. Then there exists at to the lowest degree one index in which the text is non equal to the blueprint. Let i+j be the smallest of such indices: T[i...i+j-one] = P[0...j-1] and T[i+j] ≠ P[j]

enter image description here

Since there is no friction match at position i, nosotros should start finding a match from another position, but the question is from where? Our previous algorithm e'er selects as next position i+i, but if we starting time from i+ane, it is probable that we could stop up finding a mismatch in a position even before than i+j-1. At least, we could want to start from a position that guarantees that the strings matches until i+j-one. Therefore, nosotros should start finding for a match starting from the smallest i+k such that T[i+k...i+j-1] matches with some prefix of P (look at the moving-picture show in a higher place).

Since we already know that T matches with P from i to i+j-1, and so T[i+k...i+j-i] is a suffix of P[0...j-1].

That means that if we discover a mismatch at position j, we should offset from the smallest 1000 such that P[k...j-1] is a prefix of P. grand is the smallest, and so P[k...j-1] is the largest proper suffix that is besides a proper prefix. From now on we will telephone call "border" to the proper prefixes that are as well proper suffixes (e.thousand the string ABCDABCDAB accept two borders ABCDAB and AB).

Nosotros find a friction match. Using the same argument, it's easy to meet that we have to start finding for a match from the smallest thou such that P[1000...j-1]is a proper prefix of P

Allow's encounter the above two cases with an instance:

String matching sample

In the picture above we are finding occurrences starting from position iv, just in that location is a mismatch at position 12. Before getting a mismatch we take already matched the string HACKHACK, and the largest proper prefix that is also a proper suffix (border) of HACKHACK is HACK, so we should start finding occurrences again from position 8 and taking into account that all characters from eight to 11 are already matched. It turns out that in that location is an occurrence starting at position 8, so all characters of the blueprint have matched, since HACKHACKIT does not have any edge, and then we start finding occurrences over again starting from position 18.

According to the previous analysis we demand to know for each i, the the largest border of P[0...i]. Let f[i] be the length of the largest border of P[0...i]. Function f is known every bit failure function, because it says from where commencement if we find a mismatch.

How can nosotros calculate f[i] efficiently? One approach is to think in an inductive way: suppose that we have already calculated the office f for for all indices less than i. Using that information how can nosotros calculate f[i]?

A border is a border of a border

Annotation that if P[0...j] is a edge of P[0...i], then P[0...j-1] is a border of P[0...i-1] and P[j] = P[i].

The previous statement suggest this algorithm for finding f[i]: Iterate over all borders of P[0...i-1], in decreasing order of length, until find a edge P[0...j-1] such that P[j] = P[i]. How can we iterate over all the borders of a cord? That can be easily washed using this ascertainment: if B is a edge of P, so a border of B is likewise a edge of P, that implies that the borders of P[0...i] are prefixes of P that ends at f[i]-1, f[f[i]-1]-1, f[f[...f[i]-1...]-1]-1, ... (at nearly how many borders can have a string?).

What is the complexity of our algorithm? Let P[j...i-1] be the largest edge of P[0...i-ane]. If P[k...i] is the largest edge of P[0...i], so k ≥ j (Why?). So when we iterate over the borders of P[0...i-i], we are moving the index j to k. Since j is moving e'er to the right, in the worst example it volition touch all the elements of the array. That means that we are computing office f in O(|P|).

Using function f it is easy to search for the occurrences of a pattern on a text in O(|T|+|P|):

          def failure_function(p):    l=len(p)    f=[0]*l    j=0     for i in range(ane,l):       while j>=0 and p[j]!=p[i]:          if j-i>=0: j=f[j-ane]          else: j=-one        j+=1       f[i]=j    return f  def find_occurrences(t,p):    f=failure_function(p)    lt,lp=len(t),len(p)    j=0    for i in range(lt):       while j>=0 and t[i]!=p[j]:          if j-1>=0: j=f[j-1]          else: j=-1       j+=1         if j==lp:          j=f[lp-1]          print i-lp+1

As you tin can note from the pseudo lawmaking (it is python code indeed), find_occurrences is almost equal to failure_function, that is because in some sense failure_function is like matching a string with itself.

The algorithm described in a higher place is known equally Knut-Morris-Pratt (or KMP for short). Note that with KMP algorithm we don't demand to have all the string T in memory, we can read information technology graphic symbol by character, and determine all the occurrences of a blueprint P in an online fashion.

The Z office

Given a string Due south, permit z[i] be the longest substring of S that starts at i and is as well a prefix of S.

Case:

z function sample

z[ 3 ] = iv because starting from position three, the largest string that is too a prefix of S is ABRA.

If we can calculate the function Z efficiently, then how can nosotros find all the occurrences of a pattern P on a text T? Since z[i] gives matches with a prefix, a good thought is to detect how the z function behaves when P is concatenated with T. So let'due south calculate the function z on S = P+T. Information technology turns out that if for certain i, z[i] ≥ |P| and then there is an occurrence of P starting at i.

Now only remains to observe a manner of summate that powerful z part.

The naive approach for calculate z in every i is to iterate over all j ≥ i until we find a mismatch ( z[i...j] ≠ z[0...j-i] ), then z[i] = j-i. This algorithm is of quadratic time, so we need a improve solution.

z-boxes

The z part applied at position i of a string determines an interval [i, i+z[i]-i] known every bit z-box, that interval is special, considering past the definition of z, S[ i...i+z[i]-1 ] = Due south[0...z[i]-1].

Now let's call back again in a inductive way, and summate z[i] given that we already know z[0],...,z[i-ane].

Permit [L,R] exist the z-box with largest R that we accept seen until at present.

If i is within [L,R], then S[i...R] = S[i-L...R-L] (await at the pic above), so we can employ the value of z[i-L] that nosotros take already calculated. If z[i-Fifty] Is less than R-i+1, then z[i] = z[i-50], otherwise since nosotros already know that S[i...R] is a prefix of South, information technology remains to check if nosotros can expand S[i...R] starting from R+1.

On the other hand If i is outside [Fifty...R], and so we can calculate z[i] using the naive approach. This algorithm is linear because the pointer R traverses the array at near in one case.

          def zeta(s):    n=len(s)    z=[0]*northward    L,R=-1,-1    for i in range(i,n):       j,k=0,i       if L<=i<=R:          ii = i-L          j=min(ii+z[two], R-L+i)-ii          k=i+j       while k<n and south[j]==s[k]:           j+=1          k+=1       z[i]=1000-i       if z[i]>0 and i+z[i]-i>R:           Fifty,R=i,i+z[i]-one    return z  def find_occurrences(p,t):    lp,lt=len(p),len(t)    z=zeta(p+t)    for i in range(lp,lp+lt):       if z[i]>=lp: impress i-lp

Hashing strikes back (Rabin-Karp algorithm)

Pattern P matches with text T at position i, if and but if there is a substring of T that starts at i and is equal to P. And then if we can compare speedily two strings ( T[i...i +|P|-ane] with P ), then nosotros can employ our naive algorithm (iterate over all i, and cheque if there is a match). One manner of checking if ii strings are not equal is to find a property that is different in those strings. Permit h be a black-box that take as input a string, and outputs sure holding of the string, that property is defined in such a way that if 2 strings are different, the value of that holding is also different: if S1 ≠ S2 then h(S1) ≠ h(S2). Notation that with this definition, we can't assert if 2 strings are equal, because two dissimilar strings can have the aforementioned value of the chosen property.

Since numbers are like shooting fish in a barrel to compare it will be dainty if h outputs a number. And then the question is how to represent a string equally a number. Note that strings are like numbers, considering nosotros can consider its characters as digits. Since the alphabet accept 26 letters, nosotros could say that a string is a number in a numeration system of base 27. The problem is that strings can be very large, and we tin can stand for integers in a very limited range (from 0 to 18446744073709551615 using a 64 fleck unsigned integer), in society to continue the output of h in a pocket-size range, allow's apply the modulo operation (is because of this last necessary operation that two unlike strings can map to the same integer i.e there is a hash standoff).

Case: if S="hack", so h(South) = (8 · 283 + 1 · 282 + 3 · 281 + 11 · 280) %One thousand.

Notation that we are considering the value of the digit "a" as 1 instead of 0, that is to prevent cases with leading zeroes (e.thousand "aab" and "b" are different, only if a = 0 they are equal).

A office similar h, that converts elements from a very large range (like strings) to elements in a small range (64 bit integers) are called hash functions The value of the hash role applied to certain element is called it's hash.

In order to reduce hash collisions, Thou is usually chosen as a big prime number. However if we utilize an unsigned 32 bits integer we could just let the value overflow (in this case the modulo is 232).

It remains to solve this problem: if we know the hash of the substring (of length |P|) that starts at i, how to summate the hash of the substring that starts at i+ane? (encounter figure below)

enter image description here

Let h be the hash of the substring of length |P| that starts at i. We can calculate the hash of the substring that starts at i+1 in ii steps:

Remove the graphic symbol at position i:

rem pos i

Add together the character at position i+|P|:

add pos i+p

The idea of summate the hash that starts at position i+one using the hash at i is called rolling hash.

          def val(ch): #maps a grapheme to a digit e.chiliad a = 1, b = two,...    return ord(ch)-ord("a")+ane  def find_occurrences(p,t):    lp,lt=len(p),len(t)    thousand=x**9+seven #modulo (a "big" prime)    b=xxx      #numeration organization base    hp=0      #hash of the blueprint    ht=0      #hash of a substring of length |P| of the text    for i in range(lp):       hp=(hp*b+val(p[i]))%thou       ht=(ht*b+val(t[i]))%m    pwr=1    for i in range(lp-1):        pwr=pwr*b%one thousand     if hp==ht: print 0    for i in range(1,lt-lp+i):       #rolling hash       #remove character i-one:       ht=(ht-val(t[i-1])*pwr)%m       ht=(ht+m)%m       #add character i+|P|-one       ht=(ht*b+val(t[i+lp-one]))%m       if ht==hp: impress i

Annotation that in the code above, we are finding matches with high probability (because of hash standoff). It is possible to increase the probability using ii modulos, but in programming contests usually one modulo is plenty (given a modulo how to generate two different strings with the same hash?).

Related Online Judges Bug

A Needle in the Haystack
Tavas and Malekas
Monk and Match Making
Matching

Boosted Resources

Introduction to Cord Searching Algorithms
Z Algorithm

hicksprou1944.blogspot.com

Source: https://www.hackerearth.com/practice/notes/exact-string-matching-algorithms/

Probability of Finding Same Substring Again

0 Response to "Probability of Finding Same Substring Again"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel