java.lang.Runtime.exec and Process == hang

Neat program called HTML Tidy, been around for years, open source (http://tidy.sf.net). Works perfectly: you pump in an HTML file to stdin, it prints out a “tidied” version on stdout.

So, I use this:


/*
* Launch the external process, a single-file linux binary sitting in the current directory
*/

      Process p = Runtime.getRuntime().exec( new String[] { "./tidy", "-i", "--doctype", "omit", "--show-body-only", "true", "-f", "htmltidy-error.log" } );

/*
 * attach to the stdin, and send the HTML file to Tidy
 */
      PrintWriter htmlTidyWriter = new PrintWriter( p.getOutputStream() );
      htmlTidyWriter.print( fragment );
      htmlTidyWriter.flush();
      htmlTidyWriter.close();
      
/*
 * attach to stdout, and read the result back from Tidy
 */            
      logger.info( "Reading data from external linux process = \"tidy\"");
      
      BufferedReader br = new BufferedReader( new InputStreamReader( p.getInputStream() ) );
      String line = null;
      while( (line = br.readLine()) != null )
      {
            logger.debug( "read line from Tidy: "+line );
            result.append( line+"\n" );
      }
      br.close();
      
      logger.info( "Completed discussion with external linux process = \"tidy\"; kill'ing it since java has no wait-for-exit command");

And, when run locally, it always works.

But…when run on the JGF server (same OS, almost exactly the same kernel, same software installed, same JVM - 1.4.2_05), most of the time it hangs. And…it hangs on the first attempted read from the BufferedReader.

i.e. most of the time, either the JVM is failing to launch this process, or the process itself is hanging. Given how widely used HTMLTidy is, and how mature, I find it hard to believe that anyone else experiences the same problem :).

In the API docs, it mentions something about how the process could deadlock if you read/write to it at the wrong times. This seems strange to me - block, sure. Deadlock? :o. Maybe that’s what’s happneing? Especially given it never happens locally but often (not always) happens remotely (i.e. it’s probably a race condition of some sort)…

Other than that I’m stumped. And it’s kind of hard to put out a piece of software and say “hey, it hangs most of the time, but I have no idea why. Just keep trying and it will work eventually”.

PS: no, I can’t use the so-called java port: the maintainers stopped maintaining it 3 years ago and it lacks most of the features of the real HTMLTidy, certainly at least 3 I cannot do without. Sob.

I would mod the script to handle the error out stream of the proces also. I have had Process hang when its dumping to both the output and th error stream and I wasn’t reading both.

Good point. However…

The official docs for “tidy” are that nothing will be output to stderr if you divert errors to a log file (which are what the last two parameters do).

I’ve verified that it is indeed writing it’s errors to that error file, and it does this on every launch.

When it hangs, and when it doesn’t, is on the exact same input data, and all input data has at least one error - yet sometimes it Just Works. So, I fear that “not reading from stderr” isn’t the cause of this problem :(.

Crap.

Seems that Runtime.exec(…) is basically just buggy and unloved by Sun, with outstanding unfixed bugs going back as far as 1.1.x:

http://bugs.sun.com/bugdatabase/view_bug.do;:WuuT?bug_id=4103432

c.f. the comment on xxxxx@xxxxx 2004-03-29:


The evaulator states tthat the program "test" should work. So it seems
this bug report shouldn't be closed as "not a bug" since the original and
main focus of the bug report is the program "test" and that is still
reproducible with JDK 1.5 beta2 (build 43) on Windows 98/SE. It hangs there.
But as the previous evaluator notes, it exits properly on Windows 2000 and XP.
This bug was the cause of 5007388 which also manifests only on windows 9X.

But…some digging on the net threw up that other people were finding the precise same code on java 1.4.x will “never hang with server VM, and randomly hang with client VM”, so…I’m off to check whether I get a difference with different VMs.

Just an odd thought:

http://jtidy.sourceforge.net/

[quote]PS: no, I can’t use the so-called java port: the maintainers stopped maintaining it 3 years ago and it lacks most of the features of the real HTMLTidy, certainly at least 3 I cannot do without. Sob.
[/quote]
I tested it. The most important of the config options it outputs from it’s own built-in help DON’T ACTUALLY WORK, and generate errors telling you to see the help for supported options. Stupid, stupid, stupid.

Sorry, I missed your “P.S.” clause. :stuck_out_tongue:

FYI, while the last stable release was three years ago, they’ve nearly got a new release finished. Go to the homepage, and you should find links to their nightly builds. You may find that the nightlies greatly improve upon that three-year-old version. :slight_smile:

[quote]Sorry, I missed your “P.S.” clause. :stuck_out_tongue:

FYI, while the last stable release was three years ago, they’ve nearly got a new release finished. Go to the homepage, and you should find links to their nightly builds. You may find that the nightlies greatly improve upon that three-year-old version. :slight_smile:
[/quote]
/me drools

thanks for that, although I looked around extenstively, through the forums etc, all I found were long threads from 2003 lamenting the fact that it was no longer maintained.

Looked dead as a dodo. But, clearly, I need to re-examine.

If the unit tests are to be believed, there are only two bugs that make it unusable.

Unfortunately, one of these is that it deletes the content from textareas - which is rather a big problem for a CMS, where almost every page has a textarea :(.

[quote]Unfortunately, one of these is that it deletes the content from textareas - which is rather a big problem for a CMS, where almost every page has a textarea :(.
[/quote]
Q: Is the CMS itself going to be tidied on the fly, or is it the CMS content that is to be tidied? If it’s the later, is this actually an issue? If it’s the former, then perhaps you could JavaScript around it. e.g.:

<script>
    var text="Lots of CMS text right here, for your viewing pleasure!";
    
    document.forms[0].mytextbox.value = text;
</script>

Alternatively, you could dive into the JTidy code and fix the bug yourself. :slight_smile:

Good question. It’s just the content that’s being tidied at the moment (although, eventually, when online editing of the CMS code goes live we’ll obviously want to tidy that too - for pure convenience of makign it easier to edit! At least I, personally, am not going to edit raw HTML templates without some tidying ;)).

So, you’re right: the textarea bug is not an issue right now. Gah. Stressed, tired. I’m making stupid mistakes!

Anyway, FYI, tidying is currently only necessary for these use-cases:

  1. User who tries to insert malicious HTTP code; need to prune or nullify nasty tags
  2. User who tries to insert undesirable but non malicious HTTP code; use of IMG or A HREF tags in places where the only possible rason for them is to promote spam etc

…in both these cases, the tidying can be intensive. In both these cases, the check is necessary “once only”, but the output will be read thousands if not millions of times.

Hence, check must be on the server, and needs to be prior to storing in DB. This assumes that pages are read a lot more often than they are modified (which seems fair)

  1. Nice user who is crap at writing HTML
  2. Nice user who copies/pastes HTML from a word document

…only difference is that it would be “safe” to do the check on the client browser since these cases the user does not want to circumvent it

  1. Nice users who have javascript disabled or broken

…we’re back to cases 1 and 2 again, as far as “where” the tidying can be done


So, tidying needs to be:

  • server-side
  • at form-submit time
  • good at removing nasties

At the moment, we have a custom set of filters written using REGEXPs (because they’re so damn good at filtering nasty HTML; very easy to cover all the random deliberate attempts to fool HTML parsers WITHOUT having to construct full HTML parse trees) and attempting to run Tidy as a final-stage process that does most of the “co-ercing into well-formed HTML, using only modern tags, and undoind any Word/etc nastiness”.

Although Tidy does the full tree parse I don’t trust the security of the HTML to it. Yet. Maybe I will in the future; at any rate, the regexp security parsing was added before starting with Tidy, so we might as well keep it for now.