Does Java have any text passing caps?

I would like to know if there are any text parsing capabilities built into Java?

StringTokenizer, String’s startWith/endWith, compareToIgnoreCase, regex stuff… ::slight_smile:

If you’re asking for something else, be more specific.

There are also methods to parse all the primitive types from text (e.g. Integer.parseInt()).

Java 5 has a new class java.util.Scanner which is “A simple text scanner which can parse primitive types and strings using regular expressions.”

Java also ships with XML-parsing capabilities.
Libraries like Apache Commons Digester can make it really easy (once you understand how it works) to read XML into an object hierarchy.

Thanks.
Now this is more like it. :slight_smile:

I can’t exactly be more specific because I just want to know the caps.

Regex, what is that?

Regex = Regular Expressions

It’s a complex, but powerful way of sort of specifying a way to search/sort strings, and has other uses. They’ve been used for a long time in Unix, and are quite a favorite among the people who’ve used them a lot.

They take some learning, but they’re very cool, if you’re a fan of the power of command-line-like tools, and the ability to specify something very complex and specific in just a few keystrokes. That’s the best way I can describe them.

Any one else want to chime in on that one?

Regex: some examples will probably help.

Basically, you use it to find stuff within a string.
(Not sure if it works on streams?)

Say I wanted to see if a string ended in “doc”, I would use the regex:

doc$

The ‘$’ here means “end of string”, or “end of line”.
Correspondingly, a caret (’^’) means start of string/line.

Regular expressions have lots of these special characters.
e.g. ‘.’ (a period) means “any single character”
Putting an asterix (*) after anything means “zero or more of” whatever it followed.
Using a + means “one or more”. So:

^i+

matches anything that starts with (because of the ^) one or more letter i’s.

It gets more powerful as you put things together.
For example, the following expression will (kind of) tell you whether a given string is a valid email address:

^[a-zA-Z0-9_-]+@([a-zA-Z0-9_-].)+[a-zA-Z]$

They’ll also let you find all instances of something in a string, e.g. if you used the pattern:

this

on the string “Check this if you wish to be notified of replies to this topic.”, you’d get two matches.
As I already said, you can also use it to replace the strings that you find with other stuff.

Just note: this is all the power of regular expressions, not Java - Java just includes an implementation regular expressions. Many other languages have them, Perl probably being the most important.

You can get further info on regexs (in Java) at this tutorial and the java.util.regex.Pattern javadoc.

Do you have a specific task or type of task you need to achieve?

Text parser for scripts.

Scripts as in like a script in some programming language?

In that case, you may want to have a look at JavaCC, which is a parser generator.
So far as I can tell (I’ve neve used it), you write the grammar for the language you need to parse and JavaCC generates a parser that will parse the grammar for you… into what I don’t know.
I also don’t know if it generates the parser pre-compile time (i.e. as .java files) or at runtime (as a hierarchy of objeccts).

They do supply grammars for a lot of existing languages, though, including Python (a scripting language).

If you have an assignment to write a parser, however (?), using JavaCC would probably be cheating. :-[

I haven’t even enrolled into my course yet. That’s next week. :lol:

Thanks for the tip but I prefer to write my own things.

[quote]Scripts as in like a script in some programming language?

In that case, you may want to have a look at JavaCC, which is a parser generator.
So far as I can tell (I’ve neve used it), you write the grammar for the language you need to parse and JavaCC generates a parser that will parse the grammar for you… into what I don’t know.
I also don’t know if it generates the parser pre-compile time (i.e. as .java files) or at runtime (as a hierarchy of objeccts).

They do supply grammars for a lot of existing languages, though, including Python (a scripting language).

If you have an assignment to write a parser, however (?), using JavaCC would probably be cheating. :-[
[/quote]

[quote]Scripts as in like a script in some programming language?

In that case, you may want to have a look at JavaCC, which is a parser generator.
So far as I can tell (I’ve neve used it), you write the grammar for the language you need to parse and JavaCC generates a parser that will parse the grammar for you… into what I don’t know.
I also don’t know if it generates the parser pre-compile time (i.e. as .java files) or at runtime (as a hierarchy of objeccts).

They do supply grammars for a lot of existing languages, though, including Python (a scripting language).

If you have an assignment to write a parser, however (?), using JavaCC would probably be cheating. :-[
[/quote]
Writing parsers is complex task. Using parsers generators like JavaCC is much easier.
Parser genearted by JavaCC itself don’t compile parsed files or don’t even create tree represetation of them. It only guarantee lexical and syntax propriety. Interpretation of specific syntax constructions you’re writing in Java and you’re nesting it in grammar files. Normally parser is interpreting one instruction after another all the time, even if it will have to iterpret 100 times same file. Since parsing is quite CPU consuming, it’s not very effective. Better choice is to parse file once and create tree representation of it. You can do this by using JJTree, which is part of JavaCC framework. If you would like to go further and compile your script into .class file, then you will have to use some libraries for byte code manipulation like ASM and BCEL.

I wrote a parser not long ago.

Although it was horrible it worked on the most basic things.
Storing and calculating simple formulas such as i = 5 * 6.

You are right though, to get something generic working it would be extremely time consuming.
I wouldn’t say difficult though.

The reason I asked my original question is in hopes of simplifying my task.
I will take the offer of JavaCC. :slight_smile:

if you’re doing really minor scripting, you might want to look into Beanshell

Edit: if you need a quick tutorial about Regular Expressions in general (without all the Java in between), a good one is at regular-expressions.info. It also offers a quick tutorial on how to use it in Java.

[quote]For example, the following expression will (kind of) tell you whether a given string is a valid email address:

^[a-zA-Z0-9_-]+@([a-zA-Z0-9_-].)+[a-zA-Z]$
[/quote]
Just thought I would point out that this expression is broken.
It doesn’t allow dots in the name part of the email address, and it limits the parts of the domain name to a single letter or number or for the final part just a single letter (.a to .Z), So this valid address would not be matched joe.shmoe@blah.com, because of the dot between joe and shmoe, and the final part of the domain is more than one letter long.

Basically there are some missing ‘+’ characters in there.

Oh, the perils of posting code without testing it. :stuck_out_tongue:
Thanks for pointing that out.

For completeness, an expression containing swpalmer’s corrections would look like this:

^[a-zA-Z0-9_-.]+@([a-zA-Z0-9_-]+.)+[a-zA-Z]+$

Notice that, to allow ‘.’ in a character class, I had to escape it with a backslash () because ‘.’ is a special character. Same goes for plus, asterix, caret, dollar, etc.

While we are talking about regexs, it is worth mentioning that many of the interesting patterns (email address,domain name, urls, etc.) have already been done… no need to wreck your brain trying to figure them out. Google is your friend.

e.g. http://regexlib.com/