Page 1 of 2

Stack and Heap

Posted: Mon 05 Oct, 2009 9:42 pm
by Wesley
I won't go into too much detail of what I'm doing right now for the sake of time.

I was wondering if anyone has any good ideas on memory usage regarding arrays. I am implementing the Needleman-Wunsch algorithm for comparing two strings. The first and most easy idea to implement is that of a 2D array storing values, then travelling back to recover the best string comparison.

My problem is that I have to create an array of size string1.length+1 by string2.length+1; which is massive and requires a lot of memory.

I was thinking of travelling with a 2*2 array to find the edit distance between two strings and then create a "Cell" object for each cell and have a pointer to where the "Cell" object can travel back to recover the best string comparison.

Any ideas? Any questions?

Re: Stack and Heap

Posted: Tue 06 Oct, 2009 2:56 am
by polarBody
Wesley wrote:I was thinking of travelling with a 2*2 array to find the edit distance between two strings and then create a "Cell" object for each cell and have a pointer to where the "Cell" object can travel back to recover the best string comparison.
Could you elaborate on this a little more? I'm not sure I understand your plan of attack :)

Re: Stack and Heap

Posted: Tue 06 Oct, 2009 12:34 pm
by King Harold
How long are the strings?
What language will this be in?
And where/how do stacks an heaps come into this - do you mean The Heap and The Stack? Since your array is dynamically allocated (dynamic size, so.., well you could use alloca.. but The Stack is typically only about 1MB) it would be on the heap but that shouldn't be a problem..

How about a sparse matrix?

Re: Stack and Heap

Posted: Tue 06 Oct, 2009 1:19 pm
by Wesley
Okay, a more detailed description of the goal is here.

1) The strings can be as long as 500,000 characters long (or longer). My goal is to make my program work for 500,000 character-long strings.
2) I am coding in Java. That was a good question, since Java handles memory quite differently from other languages (such as C++ and C).
3) I am unsure what you mean by
"stacks an heaps come into this - do you mean The Heap and The Stack?"

You see, my problem is this. Whenever I have to compare two strings I need a grid of integers: opt[N+1][M+1] where N and M are the sizes of the two strings being compared. Do you see where this can get really ugly?

When I want to compare a string of size N or M (assuming the two are the same) that is even 100,000 characters long, I need a 100,000*100,000 grid of int values. That means I need 100,000*100,000*(4 bytes) = 40,000,000,000 bytes! That's roughly 40,000 MB.

So, my goal is NOT to use a grid in this manner. So, I'm thinking of using "Cell" objects which have references to their previous cell. I'm not sure if this will work, because I would need to have separate "Cell" objects so they don't get collected by the garbage collector.

Re: Stack and Heap

Posted: Tue 06 Oct, 2009 9:25 pm
by darkstone knight
cant you just lz77 compress the entire.. thing?

modified lz77 seems like the perfect algorithm for comparing strings strings :D

(then again, im far too lazy to find out what proper string alignement is :P )

Re: Stack and Heap

Posted: Tue 06 Oct, 2009 10:39 pm
by King Harold
Well I don't see how lz77 would come in.. You'd still need to store those grid entries "somewhere", and that "somewhere" had better not get too big..

Anyway this is DP, and with DP you can always cheat - for example, instead of just filling a grid, you could do it the "old fashioned" way: recursion with memoization (and then treat the memoization "database" as a limited-size cache)
That's possible..

Anyway, this algorithm uses 2 passes, 1 to fill a matrix and 1 to go back and "follow the optimal path". But that doesn't mean that you actually have to do that, what's wrong with building the optimal path while you're filling that matrix? Nothing, right? And if you do that, you may find that you only need to store row i-1 and row i itself (and a little something extra to remember which "way you went" at every "crossroad")

Then again it is rather late, so I may just be spewing nonsense.

This trick applies to many DP algorithms though

Re: Stack and Heap

Posted: Wed 07 Oct, 2009 12:27 am
by Wesley
King Harold, you are now on the same track as me. Wow. We are thinking the exact same thing!

I am trying to implement dynamic programming. That's where my idea of the "Cell" objects comes in. Where each cell progressively has a path back to the beginning (or in this case the end of the strings).

One of my questions lies in this cell communication. When does the Java garbage collector actually scoop up everything (it's after)? Is it after a method is finished? If that is so, this plan won't be too difficult.

Re: Stack and Heap

Posted: Wed 07 Oct, 2009 1:49 am
by polarBody
The Java garbage collector will only reclaim allocated data if there is no longer any reference to the data. For example:

Code: Select all

String s1 = new String("blah");
String s2 = new String("blah blah");
s1 = s2;
After the third statement, both s1 and s2 are pointing to the "blah blah" string, and nothing is pointing to the "blah" string, so the garbage collector will free up the "blah" string since it is unreachable to the programmer after the third line. Now imagine that the following code is executed after the code above:

Code: Select all

s2 = new String("flippy floppy");
Remember that s1 still points to "blah blah", so the "blah blah" string will not be deallocated.

As for whether the garbage collector will reclaim an object after a method ends, again it depends on whether or not a reference variable still points to the object at that point in the execution of the program.

Re: Stack and Heap

Posted: Wed 07 Oct, 2009 3:13 am
by Wesley
Hmm... From what I'm finding, the Needleman-Wunsch algorithm seems to only compute in O(MN) (M and N being the size of the two strings being compared) time and takes up at least MN memory (depending on the type of objects being used).

I'm not sure what more I can do.

I guess I could use shorts or even bytes instead of integers. That would reduce the size some, but not enough for a 500,000 character string.

I think the cell idea is the only way, but I'm quite sketchy on how I'd go about implementing it.

Re: Stack and Heap

Posted: Wed 07 Oct, 2009 4:55 am
by polarBody
Perhaps Hirschberg will come to the rescue.

Re: Stack and Heap

Posted: Wed 07 Oct, 2009 1:09 pm
by Wesley
That's awesome and crazy all at the same time!

I originally had the idea presented in the Hirschberg algorithm. I didn't think it would work. It's basically implementing the same thing as my second idea, the "Cell" objects, but with only the arrays and no "Cell"s.

Great! I will get working on research again. Thanks again!

Re: Stack and Heap

Posted: Fri 09 Oct, 2009 3:00 am
by Wesley
So much for working ahead! My professor just posted Hirschberg's paper on his algorithm for computing the LCS. LCS is somewhat different from the Edit Distance problem, but the algorithm can be used in the same fashion.

Why, oh why, must the diligent suffer!?

See the article here. :cry:

I'm kind of confused on how he slices the grid in twos...

Re: Stack and Heap

Posted: Fri 09 Oct, 2009 10:20 am
by King Harold
So.. you go to a Chinese college? :?

Re: Stack and Heap

Posted: Fri 09 Oct, 2009 1:07 pm
by Wesley
King Harold wrote:So.. you go to a Chinese college? :?
I'm not sure what you mean... I was simply looking around for how Hirschberg solved the problem. I spent an awful long time reading other articles and trying to figure out the algorithm, but to no avail. Until, I found Hirschberg's paper. But, by the time I found it, my professor announced that he'd post a paper about the Edit Distance problem, and he posted the very same article!

lol. I see what you mean now. That link isn't from my school. My professor used Blackboard to post the paper. I only posted a link to a public copy. I go to Boise State University. It's not saying much though. I'm finding more and more how uncompetitive BSU is... We are so behind in our computer science classes compared to other universities. I should've been learning about data structures and algorithms my freshman year, not my sophomore year.

EDIT: At least I have my own side web page! I just haven't had much time to add to it... I've never done anything with web development, but I like what I've done to it so far; at least it looks better than my CS professor's page.

Re: Stack and Heap

Posted: Fri 09 Oct, 2009 3:36 pm
by King Harold
Oh I see, sorry

btw, you're allowed to learn in your own time
That's what I did, in fact all of the stuff I've posted about here is stuff I learned on my own :)
And of course it's the only way to get to be actually better than the rest..
Good luck

Oh and edit distance is nice, it easily allows you to do the "2 rows"-trick, so that's cool :)