Find the answer to your Linux question:
Results 1 to 3 of 3
Hi there, I need to change this: <head> <body> <table border="1"> <tr> <th>AAA</th> <th>BBB</th> <th>CCC</th> </tr> <tr> <td>1</td> <td>11</td> <td>111</td> </tr> <tr> <td>2</td> <td>22</td> <td>222</td> </tr> <tr> <td>3</td> <td>33</td> <td>333</td> ...
  1. #1
    Just Joined!
    Join Date
    Oct 2008
    Posts
    9

    paring a html table

    Hi there,
    I need to change this:

    <head> <body> <table border="1"> <tr> <th>AAA</th> <th>BBB</th> <th>CCC</th> </tr> <tr> <td>1</td> <td>11</td> <td>111</td> </tr> <tr> <td>2</td> <td>22</td> <td>222</td> </tr> <tr> <td>3</td> <td>33</td> <td>333</td> </tr> </table> </body> </head>

    into a csv table.

    I have learned how to get rid of a new line character:
    sed -i ':a;N;$!ba;s/\n/ /g' t1.html

    Next step would be to copy everything between <table border="1"> and </table> to a new file. So the question is how do I use sed grep awk etc to accomplish that ?

  2. #2
    Linux Guru
    Join Date
    May 2011
    Posts
    1,843
    If you have python installed (and chances are you do), you can try the text2html.py python script. It will change your html to plain text. You can then redirect the output to a file.
    Last edited by atreyu; 01-08-2012 at 08:20 PM. Reason: removed direct link to script

  3. #3
    drl
    drl is offline
    Linux Engineer drl's Avatar
    Join Date
    Apr 2006
    Location
    Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
    Posts
    1,117
    Hi.

    A similarly named utility as that mentioned by atreyu, html2text, but written in c++, is also available. Here is an example, showing the context, your input file, the intermediate results file, and those results changed by sed:
    Code:
    #!/usr/bin/env bash
    
    # @(#) s1	Demonstrate conversion of HTML to text, tables, html2text.
    # http://www.mbayer.de/html2text/
    
    # Utility functions: print-as-echo, print-line-with-visual-space, debug.
    # export PATH="/usr/local/bin:/usr/bin:/bin"
    pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
    pl() { pe;pe "-----" ;pe "$*"; }
    db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
    db() { : ; }
    C=$HOME/bin/context && [ -f $C ] && $C html2text
    
    FILE=${1-data1}
    
    pl " Input data file $FILE:"
    cat $FILE
    
    pl " Results, intermediate file:"
    html2text $FILE |
    tee f1 |
    sed -e 's/^|//' -e 's/|$//' -e 's/|/,/g' > f2
    cat f1
    
    pl " Results, final (massaged by sed):"
    cat f2
    
    exit 0
    producing:
    Code:
    % ./s1
    
    Environment: LC_ALL = C, LANG = C
    (Versions displayed with local utility "version")
    OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
    Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
    GNU bash 3.2.39
    html2text - ( /usr/bin/html2text, 2008-07-23 )
    
    -----
     Input data file data1:
    <head> <body> <table border="1"> <tr> <th>AAA</th> <th>BBB</th> <th>CCC</th> </tr> <tr> <td>1</td> <td>11</td> <td>111</td> </tr> <tr> <td>2</td> <td>22</td> <td>222</td> </tr> <tr> <td>3</td> <td>33</td> <td>333</td> </tr> </table> </body> </head>
    
    -----
     Results, intermediate file:
                
    |AAA|BBB|CCC|
    |1  |11 |111|
    |2  |22 |222|
    |3  |33 |333|
    
    -----
     Results, final (massaged by sed):
                
    AAA,BBB,CCC
    1  ,11 ,111
    2  ,22 ,222
    3  ,33 ,333
    The code is available already compiled in the repositories for (at least) Debian and CentOS, and the source is available at the URL noted in the script comments.

    Best wishes ... cheers, drl
    Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
    90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
    We look forward to helping you with the challenge of the other 10%.
    ( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...