Default XML parser grabs DTD from w3.org everytime

CommanderKeith · January 5, 2012, 7:25pm

Hey have you guys come across this problem? It’s where the default java XML parser grabs the xml file’s DTD from W3.org every single time it runs. I’ve spent the last 3 days trying to figure out why my app takes so long to load and it’s because of this problem. Unbeknownst to me my app was actually getting the DTD from w3c’s site each time, which caused a delay of about 30 seconds… Geez that’s frustrating!

And smart people are having this problem too:
http://weblogs.java.net/blog/cayhorstmann/archive/2011/12/12/sordid-tale-xml-catalogs

So if i leave out this line from the XML file:

<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">

then the javax.xml.parsers.SAXParser parses the file straight away.

But if I do that then the proper DTD is not used, so the proper solution is to setup the SAX parser like this (http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic/#comment-376):

SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);

That line is not documented or mentioned anywhere on oracle.com except in the 376th comment on that w3.org blog post… gah!

Apparently W3 serve up 100 million dtd downloads/day, and the w3 guy says in the comments that 1/4 of these are from java apps:
http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic/#comment-359

I couldn’t believe how silly this problem is so I felt the need to air my frustration

Riven · January 5, 2012, 7:34pm

It’s indeed one of the biggest design flaws ever. You can bring down tens of thousands of applications by attacking this single point of failure.

CommanderKeith · January 5, 2012, 7:41pm

It’s bizarre, i would never have guessed that a simple XML parse would hook my app up to some random website.

How did you learn about the flaw?

It doesn’t seem like a well-known problem. I googled ‘SAX pause’, ‘xml delay’, ‘xml SAX stall’, and many variations but couldn’t find anything which indicated that this was my problem.

That w3 blog post was sometimes in the hits, but of course I never read so far down in the comments to see the solution.

Riven · January 5, 2012, 7:46pm

Coincidence, I just stumbled upon a webpage about it a few years ago.

pjt33 · January 5, 2012, 10:16pm

I haven’t come across this in Java, because I use an XML parser which doesn’t check against the DTD, but I have come across it in .Net. I solved it there by downloading the DTDs and then hacking the XML files before putting them through the parser.

aazimon · January 5, 2012, 11:34pm

Try using Dom4j instead.

CommanderKeith · January 7, 2012, 1:33am

I tried using an xml file without the dtd doctype declaration but then the special entities like non breaking space would throw errors.

CommanderKeith · January 7, 2012, 8:20am

So i switched from using SAX to DOM and ran into similar troubles. I found this project which has worked well:

http://code.google.com/p/java-xhtml-cache-dtds-entityresolver/

pjt33 · January 7, 2012, 9:20am

Yes, that’s why I mentioned downloading the DTDs. The hack was to remove the public DTD references and replace them with system ones.

CommanderKeith · January 7, 2012, 4:08pm

Ah i see. It’s so bizarre that this is not done by default in the java xml libraries, and that tutorials do not show how to do it.