Find the answer to your Linux question:
Page 1 of 2 1 2 LastLast
Results 1 to 10 of 18
Hello Everyone, As I mentioned in the coffee lounge, I have been using Linux for years, but only when necessary and it usually isn't. Also, what I have written has ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    Apr 2013
    Posts
    10

    XML transform


    Hello Everyone,

    As I mentioned in the coffee lounge, I have been using Linux for years, but only when necessary and it usually isn't. Also, what I have written has been very simple stuff. I have a problem now that I was hoping someone could help me with. I have an xml file formatted:

    Code:
       </item>
        <item name="VALUE_A">
          <properties>
            <name>VALUE_B</name>
            <path>/VALUE_C</path>
    I need to transform this to:

    Code:
        </item>
        <item name="VALUE_B" path="VALUE_C">
          <properties>
    So to be clear, VALUE_B should be swapped for VALUE_A and VALUE_A is gone completely. The VALUE_C in the <path></path> tags needs to be pulled out and dropped into the <item...> tag with the designation "path=" and the tags <name></name> and <path></path> need to be deleted.

    Any help would be greatly appreciated.

    Thanks!

    -Dan

  2. #2
    Trusted Penguin
    Join Date
    May 2011
    Posts
    4,353
    Hello and welcome!

    I've moved your post to the Programming/Scripting forum, where it will hopefully get better attention.

  3. #3
    Trusted Penguin
    Join Date
    May 2011
    Posts
    4,353
    My instinct here is to use Perl and the XML::Simple module. That can be intimidating to those not steeped in programming, though, so let's stick with Bash.

    Here's a little bit of code that I think will do what you want. It loops thru the XML file once (I've assumed that the file is named "test.xml" and is in the current working dir), gathering data (i.e., for every "item name" it saves "name" and "path" data to arrays). Then, it loops thru the file again, this time printing each line as is, except for the "item name" line. Instead of printing that line, it prints a line containing the captured info instead.

    Try it and see. Report problems!

    Code:
    #!/bin/bash
    
    declare -a names
    declare -a props
    declare -a paths
    
    # count the number of times "item name" appears
    declare -i n=-1
    
    # loop thru the file line by line, gathering info
    while read line; do
     # echo "LINE: $line"
      name=$(echo $line|awk -F= '/<item name/{print $2}')
      if [ -n "$name" ]; then
        name=$(echo "$name"|sed -e 's|"||g;s|>$||')
        let n+=1
        names[$n]=$name
      else
        prop=$(echo $line|awk -F\> '/<name/{print $2}')
        if [ -n "$prop" ]; then
         prop=$(echo $prop|sed -e 's|</name||')
          props[$n]=$prop
        else
          path=$(echo $line|awk -F\> '/<path/{print $2}')
          if [ -n "$path" ]; then
            path=$(echo $path|sed -e 's|</path||')
            paths[$n]=$path
          fi
        fi
      fi
    done < <(cat test.xml)
    
    # reset the counter
    n=-1
    
    # loop thru the file again, this time printing the substitute values
    while read line; do
      name=$(echo $line|awk -F= '/<item name/{print $2}')
      if [ -n "$name" ]; then
        let n+=1
        prop=${props[$n]}
        path=${paths[$n]}
        echo "<item name=\"$prop\" path=\"$path\">"
      else
        printf "$line\n"
      fi
    done < <(cat test.xml)

  4. #4
    Just Joined!
    Join Date
    Apr 2013
    Posts
    10
    Thanks you so much! I'll give it a try and let you know!

  5. #5
    Just Joined!
    Join Date
    Apr 2013
    Posts
    10
    I did run into a problem. I should have mentioned this, but in this schema there are a lot of lines with the word "name" in them. There is only one line in the schema, however, with both the word "item" and "name." Also, the particular value I am after between the <name> tags will always fall immediately after a <properties> tag.

    So, to sum up, I need to find each instance of <item name="VALUE_A" and replace it with the very next <name> tag which will fall immediately after the <properties> tag which will always come next exactly as you see above. After that there will be any number of <name> tags which can all be ignored until you get to the next <item name=" All other lines should be left in tact, however.

    I hope this makes sense... Here's a more expanded example of the schema I am dealing with:

    Code:
    <pixxml version="1.1">
      <items>
        <item name="VALUE_A">
          <properties>
            <name>VALUE_B</name>Use this name ignore all others until we get to the next item.
            <path>VALUE_C</path>
            <description>This is a description</description>
            <status></status>
            <approved></approved>
            <item_type></item_type>
            <created_by id="128799">
              <name>Some Dude</name>
            </created_by>
            <created_timestamp>2012-04-03T07:14:03Z</created_timestamp>
            <modified_by id="32105547">
              <name>Another Dude</name>
            </modified_by>
            <modified_timestamp>2013-04-19T00:56:02Z</modified_timestamp>
            <width>1280</width>
            <height>720</height>
            <timebase>23.976</timebase>
            <mime_type>video/quicktime</mime_type>
          </properties>
          <attributes />
          <tags />
          <notes>
            <note id="31364363">
              <created_by id="23306">
                <name>Director</name>
              </created_by>
              <created_timestamp>2012-04-03T23:09:29Z</created_timestamp>
              <modified_timestamp>2012-04-03T23:09:29Z</modified_timestamp>
              <text>Note text</text>
              <has_markup>false</has_markup>
              <start_frame>1270</start_frame>
            </note>
            <note id="31364499">
              <created_by id="23306">
                <name>Director</name>
              </created_by>
              <created_timestamp>2012-04-03T23:09:58Z</created_timestamp>
              <modified_timestamp>2012-04-03T23:09:58Z</modified_timestamp>
              <text>Note Text</text>
              <has_markup>false</has_markup>
              <start_frame>3499</start_frame>
            </note>
          </notes>
          <approvals />
        </item>
        <item name="VALUE_A">next Item
          <properties>
            <name>VALUE_B</name>Use this value. Ignore all other name tags until the next item.
    etc...
    Rinse and repeat throughout the document.

    Thanks!

    Dan

  6. #6
    Just Joined!
    Join Date
    Apr 2013
    Posts
    10
    The one other thing is I need to replace this:
    Code:
    <properties>
    <name>CH300Plate1_01 driver side.mov</name>
    <path>/Driving Plates</path>
    <description>Day - Industrial Baltimore - 1:35 - Driver side</description>
    With this:
    Code:
    <properties>
    <description>Day - Industrial Baltimore - 1:35 - Driver side</description>
    In other words, once I move the name and path values into the <item> tag I need to delete the lines they came from. However, all other <name> tags should remain. I just need to delete the first one after properties. There will be no more instances of <path>.

    Thanks again!

    -Dan

  7. #7
    Trusted Penguin
    Join Date
    May 2011
    Posts
    4,353
    Quote Originally Posted by strngr12 View Post
    I hope this makes sense... Here's a more expanded example of the schema I am dealing with:
    Quote Originally Posted by strngr12
    In other words, once I move the name and path values into the <item> tag I need to delete the lines they came from. However, all other <name> tags should remain. I just need to delete the first one after properties. There will be no more instances of <path>.
    yeah, i figured there's be more to the XML file than just that. okay, i'll take a look at it later. in the mean time, try your hand at it! the most important line to look at is:

    Code:
    value=$(echo $line|awk -F\> '/string_to_match/{print $2}')

  8. #8
    Trusted Penguin Irithori's Avatar
    Join Date
    May 2009
    Location
    Munich
    Posts
    3,391
    A bit late to the party, I was working on the requirement of post #1.
    Hence my code misses the "both item and name" part.

    But fwiw: The nokogiri approach should be correct, as it understands xml.
    Basically a condition is missing to meet your requirements, but you might be able to add that yourself.
    Enough xml for me today

    Anyway, for a given input.xml
    Code:
    <?xml version="1.0"?>
    <items>
      <item name="VALUE_A">
        <properties>
          <name>VALUE_B</name>
          <path>/VALUE_C</path>
        </properties>
      </item>
      <item name="VALUE_D">
        <properties>
          <name>VALUE_E</name>
          <path>/VALUE_F</path>
        </properties>
      </item>
    </items>
    This ruby script should work:
    Code:
    #!/usr/bin/env ruby
    
    require 'nokogiri'
    
    input = Nokogiri::XML(File.open("input.xml")) do |config|
      config.nonet.strict
    end
    
    xml = <<-'EOF'
    <items>
    </items>
    EOF
    
    output = Nokogiri::XML(xml)
    o_nodes = output.root
    
    input.xpath('//items/item').each do |item|
      copynode = Nokogiri::XML::Node.new(item.name, item)
    
      attr_name = item.xpath('.//properties/name').first.text
      attr_path = item.xpath('.//properties/path').first.text
    
      copynode.set_attribute('name', attr_name)
      copynode.set_attribute('path', attr_path)
    
      o_nodes << copynode
      o_nodes << "<properties></properties>"
    end
    
    
    puts output.to_xml

    The output is
    Code:
    <?xml version="1.0"?>
    <items>
    <item name="VALUE_B" path="/VALUE_C"/><properties/><item name="VALUE_E" path="/VALUE_F"/><properties/></items>
    Yes, the formatting is off.
    The reason is found in the Nokogiri mailing list, but I wasnt able to get that "noblanks" hint working.
    In XML, whitespace can be considered meaningful. If you parse a document that contains whitespace nodes, libxml2 will assume that whitespace nodes are meaningful and will not insert them for you.
    You can tell libxml2 that whitespace is not meaningful by passing the "noblanks" flag to the parser.
    You must always face the curtain with a bow.

  9. #9
    Just Joined!
    Join Date
    Apr 2013
    Posts
    10
    Quote Originally Posted by atreyu View Post
    yeah, i figured there's be more to the XML file than just that. okay, i'll take a look at it later. in the mean time, try your hand at it! the most important line to look at is:

    Code:
    value=$(echo $line|awk -F\> '/string_to_match/{print $2}')
    Thanks, I can see the line and I see what it does. The problem I have is that I don't know how to do two things. First, make sure I only catch the first instance of <name> and ignore the others. Second, only delete the first instance of <name> and not the others. However, now that I think about it, I can delete every instance of name because the value will not carry over to the new files...

    Anyhoo, yeah, then just the first one. I can't figure out how to only catch the value of that first instance of <name> per item.

  10. #10
    Just Joined!
    Join Date
    Apr 2013
    Posts
    10
    Quote Originally Posted by Irithori View Post
    A bit late to the party, I was working on the requirement of post #1.
    Hence my code misses the "both item and name" part.

    But fwiw: The nokogiri approach should be correct, as it understands xml.
    Basically a condition is missing to meet your requirements, but you might be able to add that yourself.
    Enough xml for me today

    Anyway, for a given input.xml
    Code:
    <?xml version="1.0"?>
    <items>
      <item name="VALUE_A">
        <properties>
          <name>VALUE_B</name>
          <path>/VALUE_C</path>
        </properties>
      </item>
      <item name="VALUE_D">
        <properties>
          <name>VALUE_E</name>
          <path>/VALUE_F</path>
        </properties>
      </item>
    </items>
    This ruby script should work:
    Code:
    #!/usr/bin/env ruby
    
    require 'nokogiri'
    
    input = Nokogiri::XML(File.open("input.xml")) do |config|
      config.nonet.strict
    end
    
    xml = <<-'EOF'
    <items>
    </items>
    EOF
    
    output = Nokogiri::XML(xml)
    o_nodes = output.root
    
    input.xpath('//items/item').each do |item|
      copynode = Nokogiri::XML::Node.new(item.name, item)
    
      attr_name = item.xpath('.//properties/name').first.text
      attr_path = item.xpath('.//properties/path').first.text
    
      copynode.set_attribute('name', attr_name)
      copynode.set_attribute('path', attr_path)
    
      o_nodes << copynode
      o_nodes << "<properties></properties>"
    end
    
    
    puts output.to_xml

    The output is
    Code:
    <?xml version="1.0"?>
    <items>
    <item name="VALUE_B" path="/VALUE_C"/><properties/><item name="VALUE_E" path="/VALUE_F"/><properties/></items>
    Yes, the formatting is off.
    The reason is found in the Nokogiri mailing list, but I wasnt able to get that "noblanks" hint working.
    Dude, I really appreciate the help, but that's so far over my head. I'm just learning Linux. I've considered learning Ruby, but I went with Python instead for right now. Of course, I'm only a couple weeks into my Learning Python book so that's no help with this particular problem.

    Thanks, anyway.

    -Dan

Page 1 of 2 1 2 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •