Results 1 to 4 of 4
I have a bash script that needs to read input from an XML file, which includes varying numbers of a certain type of child node. I want to be able ...
- 01-05-2011 #1Just Joined!
- Join Date
- Apr 2010
- Posts
- 7
[SOLVED] Bash XML Parsing using Perl XPath
I have a bash script that needs to read input from an XML file, which includes varying numbers of a certain type of child node. I want to be able to iterate through all the child nodes of a given parent. I installed the Perl XML-XPath package from
search.cpan.org. Once it's installed, from bash, we can do queries like
xpath -e "//ConfigurationData/DataItem/ClassInstances/ClassInstance[1]" input.xml
This query returns the first ClassInstance node in this path. However, I don't know how to query how many nodes there of this type, or how to step through them one at a time.
Googling around, I found references to a number count() function and an fn.count() function, but couldn't get either to work inside an xpath command called from bash.
Any suggestions? Thanks!
- 01-05-2011 #2
I want to preface my answer by saying that I really don't know much about XML or xpath. I am attempting an answer in part to educate myself. I hope it will be useful...
Looking at the Perl source of /usr/bin/xpath, I don't think there's a way to do what you describe - not directly, anyway. "-e" argument strings are fed into calls to XPath::find(). (And successive "-e" arguments replace each result from the previous "-e" argument with the results of the new query evaluated in the context of the result being replaced)
There are some hokey ways you could do it:
for instance, by default (unless you specify "quiet mode") xpath produces some stderr chatter:
The XML data there is coming out on STDOUT, but the "-- NODE --" lines are coming out on STDERR. So to count how many books you could do something like this:Code:$ xpath -e '//bookstore/book[3]/author' books.xml Found 5 nodes in books.xml: -- NODE -- <author>James McGovern</author> -- NODE -- <author>Per Bothner</author> -- NODE -- <author>Kurt Cagle</author> -- NODE -- <author>James Linn</author> -- NODE -- <author>Vaidyanathan Nagarajan</author>
The reason this is a hokey solution is because the xpath program is still dumping out the whole contents of each "book" (basically, in this case, pretty much the whole file) just so you can pick out that one bit of information, "how many times does "-- NODE --" appear on STDOUT?" Even though all the XML data is getting dumped to /dev/null, you're still doing the work of reading it off the disk and into memory for no good reason...Code:$ xpath -e '//bookstore/book' books.xml 2>&1 >/dev/null | egrep -e "^-- NODE --$" | wc --lines 4
(EDIT): Another method:
This works like the previous one, except instead of using the "-- NODE --" lines it just uses the count that's already printed out on STDERR before xpath starts printing the nodes themselves. Because it pipes the STDERR output through "head -1", taking just the first line, it should issue SIGPIPE to the xpath process as soon as it tries printing the first node, thus killing it as soon as it's got the info it was after - so it's more efficient than the previous version...Code:xpath -e '//bookstore/book' books.xml 2>&1 >/dev/null | head -1 | sed -e 's/^Found \(.*\) nodes.*$/\1/;'
Another method would be to simply capture the whole output of the xpath query - then you can count how many "book" top-level entries you've got and pull out individual ones as you need them... But this is also a hokey solution, because you've reduced an XML parsing problem to another XML parsing problem.
Really, a better solution would probably be to write your script in Perl and use the XPath library directly... Or at least modify the xpath utility to better suit your needs. For instance, it'd be simple to insert an "-n" option which would result in printing out the number of matches instead of printing the match data itself...
Largely-unrelated rant follows:
This kind of problem fascinates me, and I see it as one of the inherent limitations of the shell: it's not quite impossible, but it's very awkward to try to bind a library like this to the shell. Without a notion of "objects" in the shell (even "coprocesses" as in ksh, etc. could work as a rudimentary form of "objects") it's difficult to make repeat queries of a utility like this in a way that doesn't involve repeatedly re-parsing the file each time you make a query. And the shell itself doesn't have the facilities for breaking up an XML stream and assigning top-level nodes of it to array elements... And if it did, you probably wouldn't need this xpath tool... And so as a result you can't quite dump out results from xpath and post-process them in the shell, either.
It seems to me that Unix shells should have at least got an "--xml" option to the "read" built-in by now... I can understand that a lot of people aren't fond of XML, probably, and traditionalists prefer to keep the shell data-format-agnostic... But if the shell supported at least one comprehensive, extensible data format internally, then others could be implemented in terms of that one...
- 01-06-2011 #3Just Joined!
- Join Date
- Apr 2010
- Posts
- 7
Solved
Thanks, tetsujin, for a very ingenious solution! Coincidentally, someone else suggested a somewhat simpler solution to me:
xpath -e "count(//ConfigurationData/DataItem/ClassInstances/ClassInstance)" input.xml 2>/dev/null
The count function seems to do the trick. I must have been doing something wrong when I tried it before. The 2>/dev/null has the effect of suppressing some textual message, so that only the desired count appears.
Thanks also for your thoughts on XML generally and the problems with parsing it from bash.
- 01-06-2011 #4



