Results 1 to 3 of 3
Hi there,
I need to change this:
<head> <body> <table border="1"> <tr> <th>AAA</th> <th>BBB</th> <th>CCC</th> </tr> <tr> <td>1</td> <td>11</td> <td>111</td> </tr> <tr> <td>2</td> <td>22</td> <td>222</td> </tr> <tr> <td>3</td> <td>33</td> <td>333</td> ...
- 01-08-2012 #1Just Joined!
- Join Date
- Oct 2008
- Posts
- 9
paring a html table
Hi there,
I need to change this:
<head> <body> <table border="1"> <tr> <th>AAA</th> <th>BBB</th> <th>CCC</th> </tr> <tr> <td>1</td> <td>11</td> <td>111</td> </tr> <tr> <td>2</td> <td>22</td> <td>222</td> </tr> <tr> <td>3</td> <td>33</td> <td>333</td> </tr> </table> </body> </head>
into a csv table.
I have learned how to get rid of a new line character:
sed -i ':a;N;$!ba;s/\n/ /g' t1.html
Next step would be to copy everything between <table border="1"> and </table> to a new file. So the question is how do I use sed grep awk etc to accomplish that ?
- 01-08-2012 #2Linux Guru
- Join Date
- May 2011
- Posts
- 1,843
If you have python installed (and chances are you do), you can try the text2html.py python script. It will change your html to plain text. You can then redirect the output to a file.
Last edited by atreyu; 01-08-2012 at 08:20 PM. Reason: removed direct link to script
- 01-09-2012 #3Linux Engineer
- Join Date
- Apr 2006
- Location
- Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
- Posts
- 1,117
Hi.
A similarly named utility as that mentioned by atreyu, html2text, but written in c++, is also available. Here is an example, showing the context, your input file, the intermediate results file, and those results changed by sed:
producing:Code:#!/usr/bin/env bash # @(#) s1 Demonstrate conversion of HTML to text, tables, html2text. # http://www.mbayer.de/html2text/ # Utility functions: print-as-echo, print-line-with-visual-space, debug. # export PATH="/usr/local/bin:/usr/bin:/bin" pe() { for _i;do printf "%s" "$_i";done; printf "\n"; } pl() { pe;pe "-----" ;pe "$*"; } db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; } db() { : ; } C=$HOME/bin/context && [ -f $C ] && $C html2text FILE=${1-data1} pl " Input data file $FILE:" cat $FILE pl " Results, intermediate file:" html2text $FILE | tee f1 | sed -e 's/^|//' -e 's/|$//' -e 's/|/,/g' > f2 cat f1 pl " Results, final (massaged by sed):" cat f2 exit 0
The code is available already compiled in the repositories for (at least) Debian and CentOS, and the source is available at the URL noted in the script comments.Code:% ./s1 Environment: LC_ALL = C, LANG = C (Versions displayed with local utility "version") OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64 Distribution : Debian GNU/Linux 5.0.8 (lenny) GNU bash 3.2.39 html2text - ( /usr/bin/html2text, 2008-07-23 ) ----- Input data file data1: <head> <body> <table border="1"> <tr> <th>AAA</th> <th>BBB</th> <th>CCC</th> </tr> <tr> <td>1</td> <td>11</td> <td>111</td> </tr> <tr> <td>2</td> <td>22</td> <td>222</td> </tr> <tr> <td>3</td> <td>33</td> <td>333</td> </tr> </table> </body> </head> ----- Results, intermediate file: |AAA|BBB|CCC| |1 |11 |111| |2 |22 |222| |3 |33 |333| ----- Results, final (massaged by sed): AAA,BBB,CCC 1 ,11 ,111 2 ,22 ,222 3 ,33 ,333
Best wishes ... cheers, drlWelcome - get the most out of the forum by reading forum basics and guidelines: click here.
90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
We look forward to helping you with the challenge of the other 10%.
( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )


Reply With Quote