Find the answer to your Linux question:
Results 1 to 5 of 5
I have an html file like this HTML Code: ....HTML code.... <div class="IOSSectionTitle"> A title </div> <div> <div class="level2"> Presentation </div> <div class= "p" > Sometext on multiple lines like ...
  1. #1
    Just Joined!
    Join Date
    Apr 2010
    Posts
    2

    cutting an html file apart - perl? sed? awk?

    I have an html file like this




    HTML Code:
    ....HTML code....
    <div 
    class="IOSSectionTitle">A title</div>
                                                           <div><div 
    class="level2">Presentation</div>
    <div class="p">Sometext on multiple lines like this.Sometext on 
    multiple lines like this.Sometext on multiple lines like this.
    Sometext on multiple lines like this.Sometext on multiple lines like this.</div>
    </div>
                                                        </td>
                                                     </tr>
                                                  </tbody></table>
                                               </div>
                                            </div>
                                            <div style="margin-left: 20px;"><img
    ...some more HTML code...
    I would like to cut the above text so i get this:
    Sometext on multiple lines like this.Sometext on
    multiple lines like this.Sometext on multiple lines like this.
    Sometext on multiple lines like this.Sometext on multiple lines like this.


    There are other HTML files with similar cuts I need to do, but once I have the method for doing one, I am sure I can do the others.

    I think the two logical strings to cut between would be:
    class="IOSSectionTitle"

    and

    <img

    I am not sure if these strings are always the start and end of the line respectively, is this makes a lot of difference! Then the HTML tags would need to be stripped to get the text on its own.


    I know the commands for removing tags, but searching for a string like class="IOSSectionTitle", and cutting everything before it etc is something I am finding challenging.


    Just thought I would add that the HTML does not nec. appear on logical new lines throughout the file and there may be unexpected new lines, but as far as i know the class="IOSSectionTitle" and <img always appears as a string without any new lines between those characters

  2. #2
    Linux Engineer GNU-Fan's Avatar
    Join Date
    Mar 2008
    Posts
    935
    Debian GNU/Linux -- You know you want it.

  3. #3
    drl
    drl is offline
    Linux Engineer drl's Avatar
    Join Date
    Apr 2006
    Location
    Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
    Posts
    1,117
    Hi.

    Many choices, here are a few (including html2text as noted by GNU-Fan):
    Code:
    #!/usr/bin/env bash
    
    # @(#) s1	Demonstrate text extraction from html fragment.
    
    # Uncomment to run script as external user.
    # export PATH="/usr/local/bin:/usr/bin:/bin"
    # Infrastructure details, environment, commands for forum posts. 
    set +o nounset
    pe() { for i;do printf "%s" "$i";done; printf "%s\n"; }
    LC_ALL=C ; LANG=C ; export LC_ALL LANG
    pe ; pe "Environment: LC_ALL = $LC_ALL, LANG = $LANG"
    pe "(Versions displayed with local utility \"version\")"
    c=$( ps | grep $$ | awk '{print $NF}' )
    version >/dev/null 2>&1 && s=$(_eat $0 $1) || s=""
    [ "$c" = "$s" ] && p="$s" || p="$c"
    version >/dev/null 2>&1 && version "=o" $p printf specimen html2text html2 lynx
    set -o nounset
    pe
    
    FILE=${1-data1}
    
    # Sample file, with head & tail as a last resort.
    pe " || start [ first:middle:last ]"
    specimen 10 $FILE \
    || { pe "(head/tail)"; head -n 5 $FILE; pe " ||"; tail -n 5 $FILE; }
    pe " || end"
    
    pe
    pe " Results, html2text:"
    html2text data1
    
    pe
    pe " Results, html2:"
    html2 < data1 
    
    pe
    pe " Results, lynx:"
    lynx -dump -force_html data1
    
    exit 0
    producing:

    Code:
    % ./s1
    
    Environment: LC_ALL = C, LANG = C
    (Versions displayed with local utility "version")
    OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
    Distribution        : Debian GNU/Linux 5.0 
    GNU bash 3.2.39
    printf - is a shell builtin [bash]
    specimen (local) 1.15
    html2text - ( /usr/bin/html2text Jul 23 2008 )
    html2 - ( /usr/bin/html2 Aug 26 2009 )
    Lynx Version 2.8.7dev.9 (27 Apr 2008)
    
     || start [ first:middle:last ]
    Whole: 10:0:10 of 16 lines in file "data1"
    <div 
    class="IOSSectionTitle">A title</div>
                                                           <div><div 
    class="level2">Presentation</div>
    <div class="p">Sometext on multiple lines like this.Sometext on 
    multiple lines like this.Sometext on multiple lines like this.
    Sometext on multiple lines like this.Sometext on multiple lines
    like this.</div>
    </div>
                                                        </td>
                                                     </tr>
                                                  </tbody></table>
                                               </div>
                                            </div>
                                            <div style="margin-left:
    20px;"><img
     || end
    
     Results, html2text:
    A title
    Presentation
    Sometext on multiple lines like this.Sometext on multiple lines like
    this.Sometext on multiple lines like this. Sometext on multiple lines like
    this.Sometext on multiple lines like this.
    
     Results, html2:
    /html/body/div/@class=IOSSectionTitle
    /html/body/div=A title
    /html/body= 
    /html/body/div/div/@class=level2
    /html/body/div/div=Presentation
    /html/body/div= 
    /html/body/div/div/@class=p
    /html/body/div/div=Sometext on multiple lines like this.Sometext on multiple lines like this.Sometext on multiple lines like this. Sometext on multiple lines like this.Sometext on multiple lines like this.
    /html/body/div= 
    error: Unexpected end tag : td
    error: Unexpected end tag : tr
    error: Unexpected end tag : tbody
    error: Unexpected end tag : table
    error: Unexpected end tag : div
    error: Unexpected end tag : div
    /html/body= 
    /html/body/div/@style=margin-left: 20px;
    error: Couldn't find end of Start Tag img
    /html/body/div/img
    
     Results, lynx:
       A title
    
       Presentation
       Sometext on multiple lines like this.Sometext on multiple lines like
       this.Sometext on multiple lines like this. Sometext on multiple lines
       like this.Sometext on multiple lines like this.
    See man pages for details ... cheers, drl
    Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
    90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
    We look forward to helping you with the challenge of the other 10%.
    ( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )

  4. #4
    Just Joined!
    Join Date
    Apr 2010
    Posts
    2
    That's excellent, I am not in front of my linux PC right now - will that cut the rest of the file apart too, so just the fragment as above is left?

  5. #5
    Linux Enthusiast scathefire's Avatar
    Join Date
    Jan 2010
    Location
    Western Kentucky
    Posts
    616
    Perl also has a module called HTML::Strip.
    linux user # 503963

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...