Lots of websites have and share very useful information. Simple things, like this weeks lottery numbers! Like this http://www.yle.fi/tekstitv/txt/P471_01.html. The point being that every week this web page gets updated with the newest lottery numbers.
Now, how can these numbers be utilized in my application? Websites all over can have useful information you would want to use while they don’t necessarily output that information easily for applications to use, but rather in a format that is easy for human beings to read.
What is the best way of utilizing these in applications? I was sort of thinking about downloading the .html page from the URL and then dividing it into lines. Then go through each line until you find the line you want based on a matching string that is close to the information you really want. In the example above this would be the string “OIKEAT NUMEROT:” which directly precedes the winning lottery numbers. However, the .html is riddled with .html tags and so you’d have to work your way through all the nonsense to get to the information you want ( the lottery numbers ).
Next week the layout is the same but something might have changed slightly and your code might not work. Is there a better way?
For instance, in the browser the winning lottery numbers are directly preceded by the string “OIKEAT NUMEROT:” - grabbing the numbers from a layout like that would be much easier than an .html file.
Is there a way to get convert a .html into plain text as it is in the browser? Is this the right way?
yeah you just strip all tags
not sure how exactly you do this is java, in php there are fixed functions - but I guess java has them too
also its just a big string, so all string stuff works there too
JSoup is also good for pulling out tags and content, leaving the rest for regex matching. HTMLUnit is a test framework, but also useful for general web-scraping tasks. If you need to handle javascript, you’re probably stuck with Selenium, which is a massive pill to swallow, but ultimately it’ll do damn near anything you can imagine (just not fast).
sproingie: I wish that you’d have more thoroughly highlighted the importance of regex (regular expression) since I now realize that regular expressions were the ultimate answer to my question. You really don’t need the html parser at all for what I was trying to do.
For those familiar with unix regex is basically a super powerful version of grep.
So the answer to my original question: Trying to find and grab this weeks lottery numbers from this website http://www.yle.fi/tekstitv/txt/P471_01.html (the 7 numbers are “OIKEAT NUMEROT:” would have been something like with the help of regex:
A slightly more serious reply: my job involves some seriously intimate knowledge of regular expressions, and I’ve worked on regex engines, and I can tell you that they’re really not the appropriate tool for the job. For any given page where you already know the exact html it produces, you can scrape with regexes, yes, but for general purpose parsing, it’s impossible to parse html with normal regexes, and really quite tricky, error-prone, and very very slow to do it with the extended dialects.
[quote]A slightly more serious reply: my job involves some seriously intimate knowledge of regular expressions, and I’ve worked on regex engines, and I can tell you that they’re really not the appropriate tool for the job. For any given page where you already know the exact html it produces, you can scrape with regexes, yes, but for general purpose parsing, it’s impossible to parse html with normal regexes, and really quite tricky, error-prone, and very very slow to do it with the extended dialects.
[/quote]
Hmm. What would be the better way then? Parsing the HTML as a pure text (ie. removing all tags etc) file and regexing that?
Any HTML parser, like jsoup, can return the text inside tags. Then you run a regex on that, yes.
Realistically, if you know for sure they don’t put any intervening tags inbetween the numbers you’re looking for, you can just go ahead and regex match the html source. It’s more brittle as solutions go, but all web scraping is a bit hacky. It’s just not usable as a general-purpose solution.
Yeah it’s a bit messy with the tags, but a simply function like the one below should (:P) remove most cluttering tags, in which case the regex becomes much easier to handle.
The thing that I was wondering about the HTML parser: How do you find the exact tag you’re looking for? Since the information is separated by font tags and what not, how do you find the correct one?