Find the answer to your Linux question:
Results 1 to 4 of 4
Hi, I've used wget (with mirror) to download a site. All is fine with that part, however the paths in the html files are now incorrect. They refer to urls ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    Aug 2008
    Posts
    16

    Question replace path in multiple html files


    Hi,

    I've used wget (with mirror) to download a site. All is fine with that part, however the paths in the html files are now incorrect. They refer to urls from a DB rather than static html files.

    I've got a list of the incorrect hrefs, and I have a list of what I would like them to be.

    How can I automate going through all the files, searching for the bad hrefs and replacing them with the good?

    so for example - incorrect href maybe

    <a href="/meat-veg/meat_veg_more_meat">meat</a>

    desired href would be

    <a href="meat_veg_more_meat.html">meat</a>

    They won't all be meat related, and the number of special characters such as _ changes.

    Any ideas please?

    Thank you in advance

  2. #2
    Trusted Penguin
    Join Date
    May 2011
    Posts
    4,353
    hi,

    you could do this with a sed command in a bash loop. here's an example (assuming your bad hrefs are in *.html files in the current dir):

    Code:
    while read file; do
      cat $file |sed -e 's|a href="/.*/\(.*\)">|a href="\1.html">|'
    done < <(ls *.html)
    test that and see if it does what you want. if so, you can replace

    Code:
    cat $file|sed -e ...
    with

    Code:
    sed -i.bak ...
    to have sed modify the html files (and also make backups of the originals saved with a ".bak" file extension).

  3. #3
    Just Joined!
    Join Date
    Aug 2008
    Posts
    16
    Hi,

    Thanks for your help.

    The test of the script looked good, so I changed it as instructed to
    Code:
    #!/bin/bash
    while read file; do
      sed -i.bak 's|a href="/.*/\(.*\)">|a href="\1.html">|'
    done < <(ls *.html)
    but now its echoing this

    Code:
    sed: no input files
    All the .html files are in the folder the script is being run from...

    Thanks

    Quote Originally Posted by atreyu View Post
    hi,

    you could do this with a sed command in a bash loop. here's an example (assuming your bad hrefs are in *.html files in the current dir):

    Code:
    while read file; do
      cat $file |sed -e 's|a href="/.*/\(.*\)">|a href="\1.html">|'
    done < <(ls *.html)
    test that and see if it does what you want. if so, you can replace

    Code:
    cat $file|sed -e ...
    with

    Code:
    sed -i.bak ...
    to have sed modify the html files (and also make backups of the originals saved with a ".bak" file extension).
    Last edited by atreyu; 05-02-2013 at 10:32 PM. Reason: added CODE tags

  4. #4
    Trusted Penguin
    Join Date
    May 2011
    Posts
    4,353
    Quote Originally Posted by yeleek View Post
    The test of the script looked good, so I changed it as instructed to

    #!/bin/bash
    while read file; do
    sed -i.bak 's|a href="/.*/\(.*\)">|a href="\1.html">|'
    done < <(ls *.html)

    but now its echoing this

    sed: no input files
    duh, that's b/c I forgot to add the filename at the end of the sed line. you have to do that, b/c you are modifying the file, not STDOUT (which is sent by cat in the test code). so add the file name to the end of the sed line:

    Code:
      sed -i.bak 's|a href="/.*/\(.*\)">|a href="\1.html">|' $file

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •