Hey I have a small app that constantly searchers through a dataBuffer looking for words and sentenses. One operation would be to get the location of the 20th occurance of the word ‘the’. I was wondering if there is a better structure to use rather than a stringBuffer. Maybe some kind of Hash structure? Since the buffer is large, the indexOf() operations takes time to execute, and I am trying to improve the performance. Anyone have any suggestions?
StringBuilder in Java 5 is an unsynchronized version.
If you are searching the same string multiple times there may be some optimizations to do. How large is the databuffer? You might consider getting the array of chars and writing your own search method.
When searching for long strings, like sentences in your case, there are some good optimization techniques to speed up the linear search.
Eg. lets say the data buffer contains the string:
“The quick brown fox jumps over the lazy dog.”
and you are searching for the string “over”
You can do more than just a linear search for the ‘o’ and then check for the ‘ver’…
you can check for a match on the LAST character and based on the character that you find in the buffer at the position where you will compare with the last character of your search string, you can decide to advance your search window by more than one for the next compare.
e.g.
"The "
“over”
the ‘r’ does not match the space (’ '), but not only that, you know that there is no space in your search string so you canjump your search window ahead by 4 characters right away.
“quic”
“over”
There is no ‘c’ in “over” so you can jump ahead by 4 again…
“k bro”
“over”
This time you compare the ‘r’ from “over” with the ‘o’ from “brown”… there IS an ‘o’ in over, but it is at the first position only… so you can advance your search window by 3.
"own "
“over”
again there is no space in “over”… advance by 4… etc.
When you find a character that is in your search phrase more than once, you advance your search window to align that character from the data buffer with the LAST occurance of the character in your search phrase.
As far as data structures, there are probably much better approaches but to understand which to use you need to analyze the expected behavior of your program.
If the data is large but not enormous, and you are going to be doing a grat many such searches, you might consider scanning the entire text once at building a hash table based index to all the words in it.
If the data is going to be too big, ofcourse, this will run into memory problems and you might need to do more sophisticated disk-based indexing… Simialrly if your only going to do one or two searchs then the cost of scanning the data and building the index might be more then the cost of your brute force search.
JK
Hum Jeff I am interested in trying the approach you are talking about. Lets say I store the following paragraph into a hash table (paragraph out of random cnn article)
“Iran’s nuclear programs are a source of contention with the West. Iran’s hard-line conservative government insists the programs are aimed at peaceful purposes, and that it has the right to restart nuclear facilities and enrich uranium for the production of nuclear energy.”
How exaclty should I store it in a hash table? I am guessing the id of each word is just the word number in the paragraph.
How would I search for a sentence in a hash table? I am guessing I get all the id’s for all the words and check to see if they are next to each other? Is that really going to be faster than an indexOf operation?
Hmm. i missed that you are lookign for more then one word at a time.
What I think I would do as a first cut is this:
Create a hash table of the form Map<String,List>.
The key is a word, the entry is a list of all the starting locations of that word in your text data.
I’d use StringTokenizer to scan through the String finding the words and building the table.
When you want to find a sentance, I’d use the table to find the first word in the sentance, then do a comparison between the string you are trying to match and the substring of the entire text that begins at that location.
ah ok, so I have a hash map with all of the locations of all the words. That way I dont have to do an indexOf operation. Then just use the StringBuffer object to compare against my seach string, checking at each location of the first word. Got it, thanks.