[SOLVED] Website info Grabbing? (Answer: Regex)

jonjava · February 22, 2012, 7:33pm

Good evening!

Lots of websites have and share very useful information. Simple things, like this weeks lottery numbers! Like this http://www.yle.fi/tekstitv/txt/P471_01.html. The point being that every week this web page gets updated with the newest lottery numbers.

Now, how can these numbers be utilized in my application? Websites all over can have useful information you would want to use while they don’t necessarily output that information easily for applications to use, but rather in a format that is easy for human beings to read.

What is the best way of utilizing these in applications? I was sort of thinking about downloading the .html page from the URL and then dividing it into lines. Then go through each line until you find the line you want based on a matching string that is close to the information you really want. In the example above this would be the string “OIKEAT NUMEROT:” which directly precedes the winning lottery numbers. However, the .html is riddled with .html tags and so you’d have to work your way through all the nonsense to get to the information you want ( the lottery numbers ).

Next week the layout is the same but something might have changed slightly and your code might not work. Is there a better way?

For instance, in the browser the winning lottery numbers are directly preceded by the string “OIKEAT NUMEROT:” - grabbing the numbers from a layout like that would be much easier than an .html file.

Is there a way to get convert a .html into plain text as it is in the browser? Is this the right way?

Input appreciated!

Thank you,
jon

[EDIT]

Here’s the final program
http://lotto.pastebay.net/366193

Cero · February 22, 2012, 8:04pm

yeah you just strip all tags
not sure how exactly you do this is java, in php there are fixed functions - but I guess java has them too
also its just a big string, so all string stuff works there too

quick serach of “strip html tags java” brought this up: http://htmlcleaner.sourceforge.net/

sproingie · February 22, 2012, 8:16pm

JSoup is also good for pulling out tags and content, leaving the rest for regex matching. HTMLUnit is a test framework, but also useful for general web-scraping tasks. If you need to handle javascript, you’re probably stuck with Selenium, which is a massive pill to swallow, but ultimately it’ll do damn near anything you can imagine (just not fast).

jonjava · April 7, 2012, 12:56am

sproingie: I wish that you’d have more thoroughly highlighted the importance of regex (regular expression) since I now realize that regular expressions were the ultimate answer to my question. You really don’t need the html parser at all for what I was trying to do.

I have found a video http://www.youtube.com/watch?v=kWyoYtvJpe4 that explains regex quite well, although it is in python you shouldn’t have much trouble translating the ideas into java http://docs.oracle.com/javase/tutorial/essential/regex/.

For those familiar with unix regex is basically a super powerful version of grep.

So the answer to my original question: Trying to find and grab this weeks lottery numbers from this website http://www.yle.fi/tekstitv/txt/P471_01.html (the 7 numbers are “OIKEAT NUMEROT:” would have been something like with the help of regex:

OIKEAT NUMEROT: 5,8,12,17,25,35,38

// Param1: Regex
// Param2: website url
obj = grab( r’OIKEAT NUMEROT:\s*(\d),+\s*(\d),+\s*(\d),+\s*(\d),+\s*(\d),+\s*(\d),+\s*(\d)\s*’ , url );
int[] num = new int[7];
for(int i=0; i<7; i++)
num[i] = obj.get(i);

// or something like that, I haven’t explored how exactly regex are utilized in java yet.

sproingie · April 7, 2012, 2:34am

You really shouldn’t parse HTML with regular expressions or bad things happen

jonjava · April 7, 2012, 2:36am

I made a sample program that uses the correct java regex to find this weeks lottery numbers and additional numbers from the website:

The only thing that differed from the java video was basically that you had to use double backslashes.

First here’s working java regular expression to find the wanted information:

And here’s the program (1 class):

import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;

import java.util.regex.*;

public class Lotto {
	
	private int[] numerot = new int[7];
	private int[] lisanumerot = new int[2];
	
	
	
	public Lotto(String urlstr){
		String text = null;
		try{
			text = saveUrl(urlstr);
		}
		catch (Exception e){}
		
		if( text == null ) return;
		
		numerot = findOikeatNumerot(text); // Find this weeks correct lottery numbers ( 7 numbers )
		lisanumerot = findLisaNumerot(text); // Find this weeks correct additional numbers ( 2 number )
	}
	
	private int[] findLisaNumerot(String text) {
		// 
		// Template: LISÄNUMEROT: 22,36 
		// Actual Text to filter with regex:    LIS&Auml;NUMEROT:</font> <font color="#000000">22,36     
		//
		String regex = "LIS&Auml;NUMEROT:.+[>](\\d{1,2})\\s*.\\s*(\\d{1,2})";
		Pattern myPattern =
				Pattern.compile(regex);
		Matcher matcher =
				myPattern.matcher(text);

		if(!matcher.find()){
			System.out.println("Not found!");
			int[] num = new int[1];
			return num;
		} else {
			System.out.println("Found!");
		}
		int[] num = new int[matcher.groupCount()];
		for(int i=1; i < matcher.groupCount()+1; i++){
			num[i-1] = Integer.parseInt( matcher.group(i) );
		}
		
		
		return num;
	}

	private int[] findOikeatNumerot(String text ) {
		//
		// Template: OIKEAT NUMEROT: 5,8,12,17,25,35,38
		// Actual Text to filter with regex: 00">OIKEAT NUMEROT:</font> <font color="#000000">5,8,12,17,25,35,38     </f
		//
		String regex = "OIKEAT NUMEROT:.+[>](\\d{1,2})\\s*.\\s*(\\d{1,2})\\s*.\\s*(\\d{1,2})\\s*.\\s*(\\d{1,2})\\s*.\\s*(\\d{1,2})\\s*.\\s*(\\d{1,2})\\s*.\\s*(\\d{1,2})\\s*";
		Pattern myPattern =
				Pattern.compile(regex);
		Matcher matcher =
				myPattern.matcher(text);

		if(!matcher.find()){
			System.out.println("Not found!");
			int[] num = new int[1];
			return num;
		} else {
			System.out.println("Found!");
		}
		int[] num = new int[matcher.groupCount()];
		for(int i=1; i < matcher.groupCount()+1; i++){
			num[i-1] = Integer.parseInt( matcher.group(i) );
		}
		
		
		return num;
	}

	public static void main(String[] args){
		Lotto obj = new Lotto("http://www.yle.fi/tekstitv/txt/P471_01.html");
		System.out.println(obj);
	}
	
	@Override
	public String toString(){
		
		// Print out OIKEAT NUMEROT
		String str = "";
		str += "OIKEAT NUMEROT:";
		for (int num : numerot){
			str += " " + num;
		}
		
		// Print out LISANUMEROT
		str += "\n"; // new line
		str += "LISÄNUMEROT:";
		for (int num : lisanumerot){
			str += " " + num;
		}
		
		return str;
	}
	
	public String saveUrl(String urlString) throws MalformedURLException, IOException
    {
    	BufferedReader in = null;
    	String text = "";
    	try
    	{
    		in = new BufferedReader( new InputStreamReader( new URL(urlString).openStream() ) );

    		String line;
    		
    		while((line = in.readLine()) != null){
    			text += line;
    		}
    	}
    	finally
    	{
    		if (in != null)
    			in.close();
    	}
    	
    	System.out.println(text);
    	
    	return text;
    }
}

sproingie · April 7, 2012, 2:39am

A slightly more serious reply: my job involves some seriously intimate knowledge of regular expressions, and I’ve worked on regex engines, and I can tell you that they’re really not the appropriate tool for the job. For any given page where you already know the exact html it produces, you can scrape with regexes, yes, but for general purpose parsing, it’s impossible to parse html with normal regexes, and really quite tricky, error-prone, and very very slow to do it with the extended dialects.

jonjava · April 7, 2012, 2:40am

[quote]A slightly more serious reply: my job involves some seriously intimate knowledge of regular expressions, and I’ve worked on regex engines, and I can tell you that they’re really not the appropriate tool for the job. For any given page where you already know the exact html it produces, you can scrape with regexes, yes, but for general purpose parsing, it’s impossible to parse html with normal regexes, and really quite tricky, error-prone, and very very slow to do it with the extended dialects.
[/quote]
Hmm. What would be the better way then? Parsing the HTML as a pure text (ie. removing all tags etc) file and regexing that?

sproingie · April 7, 2012, 2:47am

Any HTML parser, like jsoup, can return the text inside tags. Then you run a regex on that, yes.

Realistically, if you know for sure they don’t put any intervening tags inbetween the numbers you’re looking for, you can just go ahead and regex match the html source. It’s more brittle as solutions go, but all web scraping is a bit hacky. It’s just not usable as a general-purpose solution.

jonjava · April 7, 2012, 3:04am

Yeah it’s a bit messy with the tags, but a simply function like the one below should (:P) remove most cluttering tags, in which case the regex becomes much easier to handle.

private String removeTags(String text){
		String str = "";
		String buffer = "";
		boolean accept = true;
		// Removes all contents inside tags from the text
		for(int i=0; i < text.length(); i++){
			if(text.charAt(i) == '<'){
				accept = false;
			} else
			if(text.charAt(i) == '>'){
				if(accept) buffer = "";
				accept = true;
				str += buffer;
				buffer = "";
			} else
			if(accept) buffer += text.charAt(i);
		}
		
		return str;
	}

The thing that I was wondering about the HTML parser: How do you find the exact tag you’re looking for? Since the information is separated by font tags and what not, how do you find the correct one?