Friday, May 6, 2011

How to parse an XHTML file that is not 100% valid?

I have XHTML files whose source is not completely valid, it does not follow the DTD of an XML document.

Like there are places where for " it uses &Idquo; or for apostrophes it uses ’. This causes exceptions in my C# code.

So is there any method or any weblink that i can use to get rid of this?

From stackoverflow
  • Well by the nature of XML it needs to be valid otherwise it won't render at all. I'd first see what type of errors it generates with W3C's validator http://validator.w3.org/

    Also consider using HTML tidy, which can be configured to fix XML as well.

    We use hpricot to fix our XML, but then again we are building rails apps. Not sure about C#

    porneL : XML does not need to be valid (in the meaning of this word defined in the spec), it needs to be well-formed.
  • You could parse the document as HTML instead since they both end up in a DOM and HTML parsers scoff at these pansy quotation mark problems. Going along with unknown's HTML Tidy idea, you could then serialize the DOM back into a valid XHTML file. (This is identical to using HTML Tidy, wihch presumably uses an HTML parser anyway, except you'd do it from C# programatically.)

  • If the file is otherwise well-formed you can define the character entities in your own DTD.

    If the file is ill-formed the HTML Agility Pack from CodePlex will parse it.

    JasonS : +1 for the Agility Pack. Saved me recently.

0 comments:

Post a Comment