Find the answer to your Linux question:
Results 1 to 3 of 3
There's no shortage of command line XML parsers it seems but I've yet to get any one of them to do my - rather simple - bidding. All I want ...
  1. #1
    Just Joined! chochem's Avatar
    Join Date
    Feb 2008
    Posts
    7

    XML parser to cut a div from html doc

    There's no shortage of command line XML parsers it seems but I've yet to get any one of them to do my - rather simple - bidding. All I want to do is to parse an html file and have it return a specific section based on an xpath id expression such as //*[@id="article"] in order to save it into a new html file.

    However, most of those that I've tried (xmlstarlet, xpath, xmllint, ...) seem to hang on the fact that the html file in question isn't wellformed (I don't give a damn - I just want that section) or give me strange errors like xpath's ("read error at /usr/lib/perl5/XML/Parser/Expat.pm line 469."). Can anybody here suggest some other tool or tell me what I'm doing wrong...?

  2. #2
    Trusted Penguin Cabhan's Avatar
    Join Date
    Jan 2005
    Location
    Seattle, WA, USA
    Posts
    3,230
    Well, first off, HTML is not XML. XHTML is XML, but regular HTML is not. So don't expect an XML parser to be able to understand an HTML document.

    Secondly, if your document is not well-formed, how do you expect an XML parser to find out what it actually means? Web browsers can do this because they know the basic meaning of each tag, and therefore if your document is not well-formed, it will try to reason about what you want the page to look like. XML parsers expect well-formed documents.

    Have you tried running one of these on a well-formed XML document?
    DISTRO=Arch
    Registered Linux User #388732

  3. #3
    Just Joined! chochem's Avatar
    Join Date
    Feb 2008
    Posts
    7
    Well, I guess that it's true that it might prove difficult to figure out where what begins and what ends... Yeah, I've used xmlstarlet to obtain some urls from an rss feed - a somewhat simpler task...

    So would I have to resort to simple line feed manipulation tools, like grep and the like, to get something like this done? That is if I can even get such a script to recognise where the <div> ends...

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...