Neat program called HTML Tidy, been around for years, open source (http://tidy.sf.net). Works perfectly: you pump in an HTML file to stdin, it prints out a “tidied” version on stdout.
So, I use this:
/*
* Launch the external process, a single-file linux binary sitting in the current directory
*/
Process p = Runtime.getRuntime().exec( new String[] { "./tidy", "-i", "--doctype", "omit", "--show-body-only", "true", "-f", "htmltidy-error.log" } );
/*
* attach to the stdin, and send the HTML file to Tidy
*/
PrintWriter htmlTidyWriter = new PrintWriter( p.getOutputStream() );
htmlTidyWriter.print( fragment );
htmlTidyWriter.flush();
htmlTidyWriter.close();
/*
* attach to stdout, and read the result back from Tidy
*/
logger.info( "Reading data from external linux process = \"tidy\"");
BufferedReader br = new BufferedReader( new InputStreamReader( p.getInputStream() ) );
String line = null;
while( (line = br.readLine()) != null )
{
logger.debug( "read line from Tidy: "+line );
result.append( line+"\n" );
}
br.close();
logger.info( "Completed discussion with external linux process = \"tidy\"; kill'ing it since java has no wait-for-exit command");
And, when run locally, it always works.
But…when run on the JGF server (same OS, almost exactly the same kernel, same software installed, same JVM - 1.4.2_05), most of the time it hangs. And…it hangs on the first attempted read from the BufferedReader.
i.e. most of the time, either the JVM is failing to launch this process, or the process itself is hanging. Given how widely used HTMLTidy is, and how mature, I find it hard to believe that anyone else experiences the same problem :).
In the API docs, it mentions something about how the process could deadlock if you read/write to it at the wrong times. This seems strange to me - block, sure. Deadlock? :o. Maybe that’s what’s happneing? Especially given it never happens locally but often (not always) happens remotely (i.e. it’s probably a race condition of some sort)…
Other than that I’m stumped. And it’s kind of hard to put out a piece of software and say “hey, it hangs most of the time, but I have no idea why. Just keep trying and it will work eventually”.
PS: no, I can’t use the so-called java port: the maintainers stopped maintaining it 3 years ago and it lacks most of the features of the real HTMLTidy, certainly at least 3 I cannot do without. Sob.