Discussion:
worst-case example for the Boyer-Moore algorithm
(too old to reply)
Arash Farzan
2009-03-24 21:56:18 UTC
Permalink
Some students observed that the example given in class does not produce
the worst-case behavior of the Boyer-Moore algorithm. In fact the
example given beats the suffix shift heuristic but the character jump
heuristic compensates for it.

It is indeed not easy to beat both heuristics at the same time: this
tells us about the power of the Boyer-Moore algorithm in practice.


However, I have this construction which theoretically beats both
heuristics and achieves the worst-case O(nm) running time:

T = (bb(ab)^(k-1))*
P = (ab)^k


for instance: T = bbababab bbababab bbababab bbababab ... (white spaces
are put for ease of read)
P = abababab


You can observe that the majority of shifts are shift by twos which
gives a O(nm) running time.
Robert Xiao
2009-03-25 04:02:41 UTC
Permalink
Post by Arash Farzan
Some students observed that the example given in class does not produce
the worst-case behavior of the Boyer-Moore algorithm. In fact the
example given beats the suffix shift heuristic but the character jump
heuristic compensates for it.
It is indeed not easy to beat both heuristics at the same time: this
tells us about the power of the Boyer-Moore algorithm in practice.
However, I have this construction which theoretically beats both
T = (bb(ab)^(k-1))*
P = (ab)^k
for instance: T = bbababab bbababab bbababab bbababab ... (white spaces
are put for ease of read)
P = abababab
You can observe that the majority of shifts are shift by twos which
gives a O(nm) running time.
I'm sort of confused (as usual).

T=b*
P=a(b^k)

...bbbbbbbbbbbbbbb...
.....abbbbbbbb...
.....X^^^^^^^^...
The mismatch causes the character shift to attempt to shift "to line up
with the last occurrence of _b_ in P" where b is the character
mismatched in T (which happens to be a "b"). But this would require
shifting P *left* k positions. So, instead, it shifts by one position
right -- this means that a pure character shift performs poorly.

Suffix shift, though, says that we should line up the characters already
matched (b^k) with an occurrence of (b^k) further left; this is
impossible, so the suffix shift will shift the whole way through.

...or did I miss something?


There's another version of Boyer-Moore which is sometimes cited, and
which (supposedly) has worst-case linear run-time. In this version, the
suffix shift shifts to the left to align the already-matched suffix, but
also specifies that the character immediately to the left of that suffix
(i.e. the one which mismatched in T) must not be equal to the character
in P in the current position -- this ensures that repeats, like that in
the second example, are skipped over.

Using that example:
bbabababbbababab
..abababab..
........X^

Ordinary suffix shift would look for the next "b" in P, which is a shift
of two. The improved suffix shift would look for the next "(not-a)b" in
P, which doesn't occur at all, so it would shift right past.

This slight modification, as it turns out, was suggested by the diagram
on slide 167 -- the "ordinary" suffix shift is described in words, but
the "improved" suffix shift is what appears to be demonstrated (if you
assume that a!=c).

Robert
Arash Farzan
2009-03-25 14:55:31 UTC
Permalink
Post by Robert Xiao
Post by Arash Farzan
Some students observed that the example given in class does not
produce the worst-case behavior of the Boyer-Moore algorithm. In fact
the example given beats the suffix shift heuristic but the character
jump heuristic compensates for it.
It is indeed not easy to beat both heuristics at the same time: this
tells us about the power of the Boyer-Moore algorithm in practice.
However, I have this construction which theoretically beats both
T = (bb(ab)^(k-1))*
P = (ab)^k
for instance: T = bbababab bbababab bbababab bbababab ... (white
spaces are put for ease of read)
P = abababab
You can observe that the majority of shifts are shift by twos which
gives a O(nm) running time.
I'm sort of confused (as usual).
T=b*
P=a(b^k)
...bbbbbbbbbbbbbbb...
.....abbbbbbbb...
.....X^^^^^^^^...
The mismatch causes the character shift to attempt to shift "to line up
with the last occurrence of _b_ in P" where b is the character
mismatched in T (which happens to be a "b"). But this would require
shifting P *left* k positions. So, instead, it shifts by one position
right -- this means that a pure character shift performs poorly.
Suffix shift, though, says that we should line up the characters already
matched (b^k) with an occurrence of (b^k) further left; this is
impossible, so the suffix shift will shift the whole way through.
right.
Post by Robert Xiao
...or did I miss something?
no.
Post by Robert Xiao
There's another version of Boyer-Moore which is sometimes cited, and
which (supposedly) has worst-case linear run-time. In this version, the
suffix shift shifts to the left to align the already-matched suffix, but
also specifies that the character immediately to the left of that suffix
(i.e. the one which mismatched in T) must not be equal to the character
in P in the current position -- this ensures that repeats, like that in
the second example, are skipped over.
bbabababbbababab
..abababab..
........X^
Ordinary suffix shift would look for the next "b" in P, which is a shift
of two. The improved suffix shift would look for the next "(not-a)b" in
P, which doesn't occur at all, so it would shift right past.
This slight modification, as it turns out, was suggested by the diagram
on slide 167 -- the "ordinary" suffix shift is described in words, but
the "improved" suffix shift is what appears to be demonstrated (if you
assume that a!=c).
Robert
That slight modification (looking for a no non-fit character in front of
a good suffix) makes the algorithm to perform linearly. The version
presented in class (which is the original version) does not require a
different character in front and therefore performs in O(nm). And yes,
there is no implicit assumption on character 'c' being the same or
different from 'a' in that slide).
ricky fan
2021-04-15 18:58:46 UTC
Permalink
Post by Arash Farzan
Post by Robert Xiao
Post by Arash Farzan
Some students observed that the example given in class does not
produce the worst-case behavior of the Boyer-Moore algorithm. In fact
the example given beats the suffix shift heuristic but the character
jump heuristic compensates for it.
It is indeed not easy to beat both heuristics at the same time: this
tells us about the power of the Boyer-Moore algorithm in practice.
However, I have this construction which theoretically beats both
T = (bb(ab)^(k-1))*
P = (ab)^k
for instance: T = bbababab bbababab bbababab bbababab ... (white
spaces are put for ease of read)
P = abababab
You can observe that the majority of shifts are shift by twos which
gives a O(nm) running time.
I'm sort of confused (as usual).
T=b*
P=a(b^k)
...bbbbbbbbbbbbbbb...
.....abbbbbbbb...
.....X^^^^^^^^...
The mismatch causes the character shift to attempt to shift "to line up
with the last occurrence of _b_ in P" where b is the character
mismatched in T (which happens to be a "b"). But this would require
shifting P *left* k positions. So, instead, it shifts by one position
right -- this means that a pure character shift performs poorly.
Suffix shift, though, says that we should line up the characters already
matched (b^k) with an occurrence of (b^k) further left; this is
impossible, so the suffix shift will shift the whole way through.
right.
Post by Robert Xiao
...or did I miss something?
no.
Post by Robert Xiao
There's another version of Boyer-Moore which is sometimes cited, and
which (supposedly) has worst-case linear run-time. In this version, the
suffix shift shifts to the left to align the already-matched suffix, but
also specifies that the character immediately to the left of that suffix
(i.e. the one which mismatched in T) must not be equal to the character
in P in the current position -- this ensures that repeats, like that in
the second example, are skipped over.
bbabababbbababab
..abababab..
........X^
Ordinary suffix shift would look for the next "b" in P, which is a shift
of two. The improved suffix shift would look for the next "(not-a)b" in
P, which doesn't occur at all, so it would shift right past.
This slight modification, as it turns out, was suggested by the diagram
on slide 167 -- the "ordinary" suffix shift is described in words, but
the "improved" suffix shift is what appears to be demonstrated (if you
assume that a!=c).
Robert
That slight modification (looking for a no non-fit character in front of
a good suffix) makes the algorithm to perform linearly. The version
presented in class (which is the original version) does not require a
different character in front and therefore performs in O(nm). And yes,
there is no implicit assumption on character 'c' being the same or
different from 'a' in that slide).
lkkl

Loading...