Results 1 to 3 of 3
There's no shortage of command line XML parsers it seems but I've yet to get any one of them to do my - rather simple - bidding. All I want ...
- 04-08-2008 #1
XML parser to cut a div from html doc
There's no shortage of command line XML parsers it seems but I've yet to get any one of them to do my - rather simple - bidding. All I want to do is to parse an html file and have it return a specific section based on an xpath id expression such as //*[@id="article"] in order to save it into a new html file.
However, most of those that I've tried (xmlstarlet, xpath, xmllint, ...) seem to hang on the fact that the html file in question isn't wellformed (I don't give a damn - I just want that section) or give me strange errors like xpath's ("read error at /usr/lib/perl5/XML/Parser/Expat.pm line 469."). Can anybody here suggest some other tool or tell me what I'm doing wrong...?
- 04-08-2008 #2
Well, first off, HTML is not XML. XHTML is XML, but regular HTML is not. So don't expect an XML parser to be able to understand an HTML document.
Secondly, if your document is not well-formed, how do you expect an XML parser to find out what it actually means? Web browsers can do this because they know the basic meaning of each tag, and therefore if your document is not well-formed, it will try to reason about what you want the page to look like. XML parsers expect well-formed documents.
Have you tried running one of these on a well-formed XML document?DISTRO=Arch
Registered Linux User #388732
- 04-08-2008 #3
Well, I guess that it's true that it might prove difficult to figure out where what begins and what ends... Yeah, I've used xmlstarlet to obtain some urls from an rss feed - a somewhat simpler task...
So would I have to resort to simple line feed manipulation tools, like grep and the like, to get something like this done? That is if I can even get such a script to recognise where the <div> ends...


Reply With Quote