Very slow ascii text parser

2.89MB text file takes around a minute to parse on my A64 3000+ 1GB RAM PC.
Yes I’m using regex to parse the files with a BufferedReader, line by line.
Please don’t tell me I have to parse the file byte->byte?
Tips, tricks?

Here is my list already:
Get rid of regex and parse byte for byte.

Are there any alternatives?

Whats the size of the buffer on your buffered reader? Would think disk access would be the bottle neck initially.

Could always just read it all into a buffer yourself and then stream from memory instead?

Kev

Do you you an explicit regexp pattern objects or recreate (implicitly) on each line?

Sample code would help.

Like whome says… make sure you compile the pattern once and reuse it, if possible.

I do something like this:


	public static void main(String[] args) throws Exception {
		File f = new File(args[0]);
		FileInputStream fis = new FileInputStream(f);
		ByteBuffer bb = ByteBuffer.allocate((int)f.length());
		fis.getChannel().read(bb);
		bb.flip();
		CharBuffer cb = Charset.forName("UTF-8").decode(bb);
		
		Pattern myPat = Pattern.compile(your_pattern_here);
		Matcher mat = myPat.matcher(cb);
		while (mat.find()) {
			do_something_with( mat.group() );
		}
	}

Depending on what youa re trying to do, it also may be a lot faster to parse with a real recursive-descent-parser then with regexp…

Regexp parsing is a huge performance killer. As Jeff said, make use of a proper recursive descent parser, that will make things much easier and faster - particularly if you can find one that does it without resorting to creating strings for each token. Javacc is normally the standard choice, but for big files, that string creation can be quite a performance issue, as we’ve found in Xj3D’s VRML parsing.

Thanks guys.
Fixed the problem.

I’ve reverted to a tree parsing algorithm which is recursive.
Very fast.

I load 2048 characters of data every loop and go through it, sort and parse.

So how come TextPad does it so bloody fast then? 2MB files in the blink of an eye.

Cas :slight_smile:

Because it uses C? :stuck_out_tongue:
j/k

I doubt that using C leads to huge performance gains in regex processing. The sun regex implementation has a fairly good performance, so I suspect you weren’t using precompiled patterns, but something like line.matches("");

I was using my own pre-compiled regex stuff.

I assume this is a one-regexp pass across the file?

Big diff btw that and trying to recognize EVERY token in the file by way of a group of reg exps.

Do the same simple regex stuff you do in TextPad in Java and you’ll see it isn’t any slower in java.
I know KILER’s remark about C was probably tongue in cheek, but to give an example: I made some translation software in java for a client which quite heavily depended on regex to parse the input files (and had to translate huge numbers of large text files to massive XML files) and it performed way, way better than their proof of concept which was written in C (which also used regex for parsing). And the finished product took me alone less than half the time to write than the C based, half functional proof of concept. The regexp part in my program didn’t even rank high in the profiler output.
I guess it must have been a special case for KILER which required a specialized parser for good performance, because my experience is that there’s nothing wrong with java’s regexp implementation’s performance.

I’m very sure you guys would be interested in this. :slight_smile:

AbstractModelLoader:
http://members.optusnet.com.au/ksaho/show/utils/AbstractModelLoader.java

PlyLoader:
http://members.optusnet.com.au/ksaho/show/utils/PlyLoader.java

Some of you will probably wonder, “Why now?”. Well my design was completely different before I destroyed it and now I came back to the issue I was having and I’ve done further analysis of the situation. Previously I really didn’t have that much time to contribute to it due to other work.
I had tested this same code against 2 different types of patterns for floating point numbers.

My current floating point number pattern in the code and against “(\d+?)”.
When I tested against the floating point pattern it is dead slow however when my class was tested against the basic digit pattern it was instant.

Originally I had taken results both singular and average of each method in that class and all had come down to ne-6 and ne-4 performance respectively.
I have come to the conclusion my pattern is the problem here, question is can it be improved?
I’ve looked in about compiling parameters and I’m going to further test them out however it takes 30seconds to run this code.
Try it. :slight_smile:

Using regexp for any kind of parsing is asking for a major issues, performance-wise.

For ply files, even string tokenizer is enough to parse simple cases. Below there is some code I have written a long time ago (It works with only specific format, I think that there are at least 2 or 3 variations, doing a full loader is a bit more complicated).

For something with acceptable performance and fast to develop, use javacc. For really high performance, you will need to write everything by yourself from the scratch. (I have used javacc to load nwn ascii files, then rewritten it by hand and got a 3 times improvement in speed…).

Here you can check out my handcrafted parser for nwn files
http://cvs.sourceforge.net/viewcvs.py/nwn-j3d/nwn/src/net/sf/nwn/loader/ManualParser.java?rev=1.16&view=markup

and below is some crappy code for ply parsing I have found in my archives


        BufferedReader br = new BufferedReader(new FileReader(filename));
        
        StreamTokenizer st = new StreamTokenizer(br);
        st.resetSyntax();
        st.eolIsSignificant(false);
        st.wordChars(0,255);
        st.whitespaceChars(' ', ' ');
        st.whitespaceChars('\n','\n');
        st.whitespaceChars('\r','\r');
        st.whitespaceChars('\t','\t');
        String str;
        float[] vertices = null;
        int[] faces = null;
        
        while ( true ) {
            int token = st.nextToken();
            if ( token == StreamTokenizer.TT_EOF )
                break;
            if ( token != StreamTokenizer.TT_WORD)
                continue;
            
            if ( st.sval.equalsIgnoreCase("element")) {
                st.nextToken();
                if ( st.sval.equalsIgnoreCase("vertex") ) {
                    st.nextToken();
                    vertices = new float[3*Integer.parseInt(st.sval)];
                } else if (st.sval.equalsIgnoreCase("face")) {
                    st.nextToken();
                    faces = new int[3*Integer.parseInt(st.sval)];
                }
            } else if (st.sval.equalsIgnoreCase("end_header") ){
                break;
            }   
        }
        
        
        for ( int i =0; i < vertices.length; i+=3) {
            st.nextToken();
            vertices[i] = Float.parseFloat(st.sval);
            st.nextToken();
            vertices[i+1] = Float.parseFloat(st.sval);
            st.nextToken();
            vertices[i+2] = Float.parseFloat(st.sval);
            st.nextToken();
            st.nextToken();
            
        }
        
        for ( int i =0; i < faces.length; i+=3 ) {
            st.nextToken();
            st.nextToken();
            faces[i] = Integer.parseInt(st.sval);
            st.nextToken();
            faces[i+1] = Integer.parseInt(st.sval);
            st.nextToken();
            faces[i+2] = Integer.parseInt(st.sval);
        }


I rewritten my code once again.
It parses the 3MB file instantly as far as I can tell. :slight_smile:
I haven’t even optimised it yet, not that I need to at this point.
I am still using regex however I have limited it’s usage to purely gathering header data.

After I have refined my design(again) I will then write a parser that parsers 508MB model.
Performance is imperative but correct values come first, followed by a unit test. :slight_smile:

For anyone interested, this topic was discussed at great length over at this thread:

http://www.javalobby.org/java/forums/m91813535.html#91813535