Link to home
Start Free TrialLog in
Avatar of zhshqzyc
zhshqzyc

asked on

Is there a c#/c++ code for suffix tree?

It is very hard to understand the concept of "suffix tree", I just want to get an existing c#/c++ program to run it directly. To save time and my brain cells.


Thanks.
Avatar of zhshqzyc
zhshqzyc

ASKER

I have a "Longest common substring problem".  Found a code here but it may not be proper.
I need to input two strings to compare.
Avatar of phoffric
>> I have a "Longest common substring problem"
See if this helps.
    http://marknelson.us/1996/08/01/suffix-trees/

I thought you may find also find this Longest Common Subsequence video interesting (not substring, although the example is a substring).

http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-046j-introduction-to-algorithms-sma-5503-fall-2005/video-lectures/lecture-15-dynamic-programming-longest-common-subsequence/

In this link you also get a verbatim transcript as well as the slides (in PDF). It is an excellent presentation of Longest Common Subsequence using dynamic programming with recursive pseudo-code that runs very fast with memoization added in.
Longest Common Subsequence is not equivalent to "Longest Common Subsubstring".
It is helpful. But sometimes the url has virus and is blocked by antivirus. I am using trend office scan.
Sorry about that. I have Norton SEP. It did catch something yesterday. Maybe it was this. Only effect was that I lost sound (reboot fixed). There was a DEP catch. Virus scan revealed nothing. Do you want me to post the contents here or did you see enough of it?
Sounds good.
I mean can you post the contents? Thanks.
ok, but I'll just give the links to the code - unless you have a pretty printer that can handle newlines properly, the code is a mess.
ASKER CERTIFIED SOLUTION
Avatar of phoffric
phoffric

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thank you very much.
In another question on LCSubstrings, I posted the following some of which you may find useful.
========================================================
India Institute of Technology:
Lecture - 17 Case Study: Searching for Patterns
    http://www.youtube.com/watch?v=Zj_er99KMb8&feature=related

Lecture - 18 Tries
    http://www.youtube.com/watch?v=uhAUk63tLRM&p=BF3763AF2E1C572F

Lecture 18 on Tries gives good background on tries, compressed tries, and decent introduction to suffix trees (although some parts were not intelligible - either speaking too fast, or stopped speaking English). But that lecture should help in reading the article posted in the other Question.

Here are slides on Edward McCreight's 1976 algorithm
       http://www.math.tau.ac.il/~haimk/seminar00/iddo-McCreight_slides.ppt

I found this video lecture given by Esko Ukkonen, University of Helsinki. His algorithm builds the suffix tree in O(n) time.
    http://videolectures.net/aop05_ukkonen_sthmt/

Before considering longest common substring, consider string searching times. KMP finds a string in O(m+n) time, but if n is huge (e.g., complete works of Shakespeare), then O(m+n) is too long especially if there are multiple queries.

Using suffix tree, the search time for each query is O(m) after the tree is built. But the limiting factor is how long does it take to build the suffix tree. It was Esko Ukkonen who succeeded in building the tree in O(n) time, even though the number of substrings is O(n^2).

So, using the Ukkonen algorithm, then the first query to find a matching string takes O(m+n) time, but subsequent queries takes only O(m) times once the suffix tree representing the string of n symbols has been built.

I haven't worked out the fine details for the longest common substring when using a suffix tree, but consider..
You noticed that when building the suffix tree that there may be identical substrings, in which case you record the starting position of each occurrence. After building the first suffix tree for str1, you could then build suffix tree for str2. In fact, you could build it continuing on the first tree. And when you find an identical substring, you could keep track of its length; and if that length is the current max length, then set max length and the positions in the two input strings.