Regex help (50th time already, sorry)

		VERTEX("^" + ".*?" + "v" +
				"(\\s-{0}?-{0,1}?\\d{1,}?){3,4}?" + 
//				"\\s(-{0,1}?\\d+?)" + 
//				"\\s(-{0,1}?\\d+?)" + 
//				"\\s{0,}?"+"(-{0,1}?\\d+?){0,1}?" + 
				".*?" + "$"),

Is supposed to find:
v x y z w

Where ‘w’ is optional.
Initially I got it working however I was trying to simplify it greatly by doing this:

“(\s-{0}?-{0,1}?\d{1,}?){3,4}?”

Here is the problem.
I break up the original string into:

(\s-{0}?-{0,1}?\d{1,}?) + {3,4}?

1st half is supposed to look for something like this:
" -133"

The second part of the string is supposed to find that pattern a minimum of 3 times and a max of 4.
However the output is all wrong.
Why?

Sample output:

1.0, 0.0, 0.0, 0.0, 
-1.0, 0.0, 0.0, 0.0, 
1.0, 0.0, 0.0, 0.0, 
-1.0, 0.0, 0.0, 0.0, 
1.0, 0.0, 0.0, 0.0, 
-1.0, 0.0, 0.0, 0.0, 
1.0, 0.0, 0.0, 0.0, 
-1.0, 0.0, 0.0, 0.0, 

How it should look:

1 1 1
1 1 -1
1 -1 1
1 -1 -1
-1 1 1
-1 1 -1
-1 -1 1
-1 -1 -1

Everytime I need regex it tears my heart out.
It’s the one thing that really cracks me.

“v x y z w?”

that’s it, no?

really. That’s why regexp’s are great - they’re really easy for handling lots of literal text

Unfortunately not great for handling dynamic amounts of text.

I understand where he is trying to go from the regex examples… it seems that the “v” is literal, but x,y,z, and w are numbers in the real data. (Thus the \d in the regex for “digit”) Right?

Yes.

THey can be negative numbers too.

There are several constructs, that look suspicious in your regex :stuck_out_tongue:


(\\s-{0}?-{0,1}?\\d{1,}?) + {3,4}?

The last ‘?’ seems to make no sence (at least to me). I think you wanted to use reluctant matches to avoid getting the greatest possible match, but I don’t see the application here. Also ‘-{0}?’ seems a bit odd…

Could you post an excerpt of the data you like to parse?

Regex! Yep, it’s cool but it’s (1) hard to write and (2) hard for read. Why not make it easy on
yourself and others? You have to show me first some people that really understand it to
convince me it is great.

How about something easier:


String line = ...
StringTokenizer st = new StringTokenizer(line, " ");
int x = Integer.parseInt(st.next());
int y = Integer.parseInt(st.next());
int z = Integer.parseInt(st.next());
if (st.hasNext()) {
   w = ...
}

I would say that’s easier to read and understand?

Somtimes the data can be 4 dimensional.
x, y, z, w.

v 1 1 1
v 1 1 -1
v 1 -1 1
v 1 -1 -1
v -1 1 1
v -1 1 -1
v -1 -1 1
v -1 -1 -1
f 1 3 4 2
f 5 7 8 6
f 1 5 6 2
f 3 7 8 4
f 1 5 7 3
f 2 6 8 4

Hmm:


(\\s-?[^\\s]+){3,4}

But I have to say kingaschi might be right, since you must have code to process the regex anyway… :wink:

My code processes it nicely and I understand the basics of how regex works:
“^” + “.?" + “v\s(\d+?)\s(\d+?)\s(\d+?)\s(\d+?)” + ".?” + “$”

However when it comes down to simplifying the above string to cater for a number of variables then it gets difficult.

cylab:
(\s-?[^\s]+){3,4} = doesn’t work, gives me the same result as I already have.

Do you mind posting a complete example with the correponding java code.

I have not really thought about the whole context, so I am afraid the problem lies in referencing the match brackets. If you use a single bracket with {m,n}, you won’t (to my knowledge) be able to reference the single matches, afaik only the last match will be returned… So I think you are stuck with your working “expanded” version of the regex.

Edit:
s/match bracket/capturing group/g :wink: so I am essentially saying the same as pepijnve below ::slight_smile:

I think your simplified version doesn’t work as expected because the second version defines less capturing groups than the first one.
The original regex you posted

"\\s(-{0,1}?\\d+?)\\s(-{0,1}?\\d+?)\\s(-{0,1}?\\d+?)\\s{0,}?(-{0,1}?\\d+?){0,1}?"

defines four capturing groups, one for x, y, z and w respectively

The simplified one

"(\\s-{0}?-{0,1}?\\d{1,}?){3,4}?"

defines only one capturing group that is matched 3 or 4 times. The contents of this capturing group will be the last succesful match.

Also I’m not sure why you are using all those {m,n} notations and reluctant quantifiers. Personally I would write it as

Pattern p = Pattern.compile(
    "v"
    + "\\s(-?\\d*(?:,\\d*)?)"
    + "\\s(-?\\d*(?:,\\d*)?)"
    + "\\s(-?\\d*(?:,\\d*)?)"
    + "(?:\\s(-?\\d*(?:,\\d*)?))?"
);

which matches stuff of the form “v x y z w” where x, y, z and w can be any number and w is optional.

So there is absoultely no way to simplify the case?

I am afraid, no. Just out of curiousity, why are you so obsessed with regexes in this context, kingaschi proposal of using a StringTokenizer seems a valid solution…

[quote=“K.I.L.E.R,post:8,topic:25460”]
You keep using that term. I do not it means what you think it means.

I think you may be overgeneralizing the problem. Correct me if I’m wrong, but all you want to do is split each line into each “word”, right? Then you need to test if you have three or four columns on that line. So something like this should do the trick:


String line = in.readLine();
String[] values = line.split("\\ "); //A regex that selects spaces and splits on them

if(values.length == 4) //Do something
else //Do something else

I assume at some point you will want to convert some of the Strings to integers, so you’ll need to remember to use Integer.parseInt(values[i]).

Does that answer your question?

Let me start off my saying that I am partial to StringTokenizers! BUT, if you read a line and pass that to a tokenizer then getting the first token will tell you whether the data will have 3 or 4 parts to it. Unless the sample you posted isnot representative! For example:


StringTokenizer tokenizer = new StringTokenizer(line);
String temp = tokenizer.nextToken();
if(temp.equals("v"))
{
    // 3D Stuff
}
else if(temp.equals("f"))
{
    // 4D Stuff
} 

I was hoping regex would be more powerful.
I don’t like string tokenizers due to the fact that regex is more powerful and much more natural to process IMO.
I know how to use regex normaly however I wanted to further simplify my reg ex command to take into effect multiple variables.
Now that I know that it isn’t possible I guess it’s time to lobby Sun over added regex functionality. :slight_smile:

Thanks guys.

Regex might be more powerful, but it’s a waste of resources in this case.

Your OBJ loader (as that’s what you’re parsing) will be mH^H^H^Huch faster when you roll your own plain simple algorithm.

It matters, when loading large model files on the fly.