Parsing URLs from text

Wildern · August 7, 2009, 2:01pm

I know this isn’t really game related, but I am hoping there is enough java expertise here to point me in the correct direction.

I need to be able to scan many blocks of text and identify all the URLs present whether they are in an anchor tag, image tag, or plain text. I believe I can handle the URLs within valid html markup just fine. It is the plain text portion that is going to give me trouble, specifically, attempts to hide the URL within other html markup such as size or color. I can’t really show an example of the color trick, but, basically, they make the surrounding text close enough to background color to appear invisible.

aaaaawww.foo.orgaaaa
aaaaawww.foo.orgaaaaa

Regular expressions are not really an option as speed is critical (I need to be able to process 300-500 blocks of text a second where the smallest block of text would be roughly the size of this post)

Pointers to open source projects already handling something similar would be perfect, but, if I need to roll my own solution, parsing advice is welcome as well.

Thanks in advance.

Riven · August 7, 2009, 2:50pm

Something like:


   static class Feedback
   {
      public void found(String input, int off, int end)
      {
         System.out.println("found: " + input.substring(off, end));
      }
   }

   public static void findUrlSimilarPattern(String input, Feedback feedback)
   {
      int off = -1;

      int periods = 0;
      for (int end = 0; end < input.length(); end++)
      {
         char c = input.charAt(end);

         if ((c >= 'a' && c <= 'z') ||
         /**/(c >= 'A' && c <= 'Z') ||
         /**/(c >= '0' && c <= '9') ||
         /**/(c == '%' || c == '_' || c == '/') ||
         /**/(c == '?' || c == '&' || c == '=') ||
         /**/(c == '-' || c == '.' || c == ':'))
         {
            if (off == -1)
               off = end;

            if (c == '.')
               periods++;

            if (end == input.length() - 1)
               end++; // ensure last match (if any) is found
            else
               continue;
         }

         if (periods != 0 && off != -1)
         {
            feedback.found(input, off, end);
         }

         periods = 0;
         off = -1;
      }
   }


   public static void main(String[] args)
   {
      Feedback feedback = new Feedback();
      findUrlSimilarPattern("hello world www google com bye world", feedback);
      findUrlSimilarPattern("hello world www.google.com bye world", feedback);
      findUrlSimilarPattern("<font>hello world</font>http://www.google.com/search?q=oh%20no<font>www.bye.world.com</font>really! no.com", feedback);
   }
}

Output:


found: www.google.com
found: http://www.google.com/search?q=oh%20no
found: www.bye.world.com
found: no.com

Wildern · August 7, 2009, 3:51pm

Yes something very similar to that, thanks!

The cases that are proving troublesome are similar to the following:

<font size=1>foo</font><font size=5>w<font>ww.go</font>ogle.com</font><font size=1>bar</font>

Where simply stripping out html would result in a string of “foowww.google.combar” which is not the URL a person would easily see when presented with the rendered html.

Making the prefix/postfix text nearly invisible through color changing is also something I would like to be able to overcome, but I realize that will be a bit more difficult.

I may have to write a document parser that has heuristics to determine if two separate tokens should be merged or not based on their associated font/color attributes and what the intervening punctuation was.

Riven · August 7, 2009, 4:25pm

Not to mention HTML encoding, like: &NNN;

Now, the thing is: do you want to parse the URLs, or simply remove them?

In the end you simply cannot prevent this behaviour. If you encounter a URL like “foowww.google.combar” (after stripping tags) that’s a clear indication somebody is trying to bend the rules, and it should be handled like any other URL.

Riven · August 7, 2009, 4:42pm

Or… try to pattern match this…

`

Wildern · August 7, 2009, 5:10pm

I need to be able to identify the destination of the URL, that includes obfuscation via encoding and redirects.
If the “bad guys” resort to ascii art or embedded images, that is a win.
I just don’t want to pass any text that has a bad URL that could be clicked or copied with cut/paste.

I have the system for dealing with the URLs in C/C++, I was hoping to not have to re-write it all from scratch for switching to java.