Results 1 to 5 of 5
I have an html file like this
HTML Code:
....HTML code....
<div
class="IOSSectionTitle"> A title </div>
<div> <div
class="level2"> Presentation </div>
<div class= "p" > Sometext on multiple lines like ...
- 04-13-2010 #1Just Joined!
- Join Date
- Apr 2010
- Posts
- 2
cutting an html file apart - perl? sed? awk?
I have an html file like this
I would like to cut the above text so i get this:HTML Code:....HTML code.... <div class="IOSSectionTitle">A title</div> <div><div class="level2">Presentation</div> <div class="p">Sometext on multiple lines like this.Sometext on multiple lines like this.Sometext on multiple lines like this. Sometext on multiple lines like this.Sometext on multiple lines like this.</div> </div> </td> </tr> </tbody></table> </div> </div> <div style="margin-left: 20px;"><img ...some more HTML code...
Sometext on multiple lines like this.Sometext on
multiple lines like this.Sometext on multiple lines like this.
Sometext on multiple lines like this.Sometext on multiple lines like this.
There are other HTML files with similar cuts I need to do, but once I have the method for doing one, I am sure I can do the others.
I think the two logical strings to cut between would be:
class="IOSSectionTitle"
and
<img
I am not sure if these strings are always the start and end of the line respectively, is this makes a lot of difference! Then the HTML tags would need to be stripped to get the text on its own.
I know the commands for removing tags, but searching for a string like class="IOSSectionTitle", and cutting everything before it etc is something I am finding challenging.
Just thought I would add that the HTML does not nec. appear on logical new lines throughout the file and there may be unexpected new lines, but as far as i know the class="IOSSectionTitle" and <img always appears as a string without any new lines between those characters
- 04-13-2010 #2Debian GNU/Linux -- You know you want it.
- 04-13-2010 #3Linux Engineer
- Join Date
- Apr 2006
- Location
- Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
- Posts
- 1,117
Hi.
Many choices, here are a few (including html2text as noted by GNU-Fan):
producing:Code:#!/usr/bin/env bash # @(#) s1 Demonstrate text extraction from html fragment. # Uncomment to run script as external user. # export PATH="/usr/local/bin:/usr/bin:/bin" # Infrastructure details, environment, commands for forum posts. set +o nounset pe() { for i;do printf "%s" "$i";done; printf "%s\n"; } LC_ALL=C ; LANG=C ; export LC_ALL LANG pe ; pe "Environment: LC_ALL = $LC_ALL, LANG = $LANG" pe "(Versions displayed with local utility \"version\")" c=$( ps | grep $$ | awk '{print $NF}' ) version >/dev/null 2>&1 && s=$(_eat $0 $1) || s="" [ "$c" = "$s" ] && p="$s" || p="$c" version >/dev/null 2>&1 && version "=o" $p printf specimen html2text html2 lynx set -o nounset pe FILE=${1-data1} # Sample file, with head & tail as a last resort. pe " || start [ first:middle:last ]" specimen 10 $FILE \ || { pe "(head/tail)"; head -n 5 $FILE; pe " ||"; tail -n 5 $FILE; } pe " || end" pe pe " Results, html2text:" html2text data1 pe pe " Results, html2:" html2 < data1 pe pe " Results, lynx:" lynx -dump -force_html data1 exit 0
See man pages for details ... cheers, drlCode:% ./s1 Environment: LC_ALL = C, LANG = C (Versions displayed with local utility "version") OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64 Distribution : Debian GNU/Linux 5.0 GNU bash 3.2.39 printf - is a shell builtin [bash] specimen (local) 1.15 html2text - ( /usr/bin/html2text Jul 23 2008 ) html2 - ( /usr/bin/html2 Aug 26 2009 ) Lynx Version 2.8.7dev.9 (27 Apr 2008) || start [ first:middle:last ] Whole: 10:0:10 of 16 lines in file "data1" <div class="IOSSectionTitle">A title</div> <div><div class="level2">Presentation</div> <div class="p">Sometext on multiple lines like this.Sometext on multiple lines like this.Sometext on multiple lines like this. Sometext on multiple lines like this.Sometext on multiple lines like this.</div> </div> </td> </tr> </tbody></table> </div> </div> <div style="margin-left: 20px;"><img || end Results, html2text: A title Presentation Sometext on multiple lines like this.Sometext on multiple lines like this.Sometext on multiple lines like this. Sometext on multiple lines like this.Sometext on multiple lines like this. Results, html2: /html/body/div/@class=IOSSectionTitle /html/body/div=A title /html/body= /html/body/div/div/@class=level2 /html/body/div/div=Presentation /html/body/div= /html/body/div/div/@class=p /html/body/div/div=Sometext on multiple lines like this.Sometext on multiple lines like this.Sometext on multiple lines like this. Sometext on multiple lines like this.Sometext on multiple lines like this. /html/body/div= error: Unexpected end tag : td error: Unexpected end tag : tr error: Unexpected end tag : tbody error: Unexpected end tag : table error: Unexpected end tag : div error: Unexpected end tag : div /html/body= /html/body/div/@style=margin-left: 20px; error: Couldn't find end of Start Tag img /html/body/div/img Results, lynx: A title Presentation Sometext on multiple lines like this.Sometext on multiple lines like this.Sometext on multiple lines like this. Sometext on multiple lines like this.Sometext on multiple lines like this.Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
We look forward to helping you with the challenge of the other 10%.
( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )
- 04-14-2010 #4Just Joined!
- Join Date
- Apr 2010
- Posts
- 2
That's excellent, I am not in front of my linux PC right now - will that cut the rest of the file apart too, so just the fragment as above is left?
- 04-14-2010 #5
Perl also has a module called HTML::Strip.
linux user # 503963


Reply With Quote