Find the answer to your Linux question:
Page 1 of 2 1 2 LastLast
Results 1 to 10 of 12
I'm trying to write a shell script that will prepare website resources for offline use. Basically, even once you have saved the entire web page, the links are broken because ...
  1. #1
    Linux Newbie egan's Avatar
    Join Date
    Feb 2009
    Location
    Mountain View, CA
    Posts
    132

    Complex Substitution - Stuck

    I'm trying to write a shell script that will prepare website resources for offline use. Basically, even once you have saved the entire web page, the links are broken because they given as urls rather than relative paths.

    Basically I want to be able to scan the index.html for <a href="*"*</a> and then replace the link with a file path. For example with:
    <a href="http://www.example.org/stuff/fun.html"Fun Stuff</a>
    with
    <a href="./stuff/fun.html"Fun Stuff</a>

    I have limited knowledge of perl and sed, and no knowledge of awk. Any advice would be greatly appreciated. I think the hardest part for me is passing the discovered file name to the text processor to be substituted.

  2. #2
    Linux User
    Join Date
    May 2008
    Location
    NYC, moved from KS & MO
    Posts
    251
    I guess you missed a > in
    <a href="http://www.example.org/stuff/fun.html"Fun Stuff</a>
    [ right before Fun ]

    Anyway try the code below to see if it works:
    Code:
    perl -p0e 's|<a href="(http://.*?\/)(.*")|<a href="./$2|g' index.html

  3. #3
    Just Joined!
    Join Date
    Apr 2009
    Posts
    33
    Quote Originally Posted by egan View Post
    I'm trying to write a shell script that will prepare website resources for offline use. Basically, even once you have saved the entire web page, the links are broken because they given as urls rather than relative paths.

    Basically I want to be able to scan the index.html for <a href="*"*</a> and then replace the link with a file path. For example with:
    <a href="http://www.example.org/stuff/fun.html"Fun Stuff</a>
    with
    <a href="./stuff/fun.html"Fun Stuff</a>

    I have limited knowledge of perl and sed, and no knowledge of awk. Any advice would be greatly appreciated. I think the hardest part for me is passing the discovered file name to the text processor to be substituted.
    original: <a href="http://www.example.org/stuff/fun.html">Fun Stuff</a>

    sed 's?\(href="\)//http:www\.example\.org\([^>]*\)?\1.\2?g'

    result: <a href="./stuff/fun.html">Fun Stuff</a>

    actually you don't need the leading dot '.':

    sed 's?\(href="\)//http:www\.example\.org/\([^>]*\)?\1\2?g'

    result: <a href="stuff/fun.html">Fun Stuff</a>

    would work perfectly
    <a href="http://www.example.org/stuff/fun.html"Fun Stuff</a>
    with
    <a href="stuff/fun.html">Fun Stuff</a>

    I assume that it's in lowercase but to be on the safe side, if you want
    to cover cases where you have
    HREF=... or Href=... etc

    you'd better replace this

    sed 's?\(href="\)//http:www\.example\.org/\([^>]*\)?\1\2?g'

    by that

    sed 's?[hH][rR][eE][fF]="//http:www\.example\.org/\([^>]*\)?href="\1?g'
    Last edited by vonbiber; 07-05-2009 at 02:37 PM. Reason: typo

  4. #4
    Linux Newbie egan's Avatar
    Join Date
    Feb 2009
    Location
    Mountain View, CA
    Posts
    132
    The missing > was just a typo...

    Looks like you guys gave me quite a lot to chew on. I'll see if I can decipher that...

    Thanks a bunch.

  5. #5
    Linux Newbie egan's Avatar
    Join Date
    Feb 2009
    Location
    Mountain View, CA
    Posts
    132
    I'm sorry; upon looking at this I see that I may not have explained adequately. I am trying to get a general purpose script, where the domains and protocols of the links themselves are not known.

    It looks like the perl script does this though and I ought to be able to hack out a regexp for the sed, it being my preference.

  6. #6
    Linux User
    Join Date
    May 2008
    Location
    NYC, moved from KS & MO
    Posts
    251
    where the domains and protocols of the links themselves are not known.
    Simply modify the original perl one-liner into
    Code:
    perl -p0e 's|<a href="(.*://.*?/)(.*")|<a href="./$2|g' index.html
    should solve the problem.

    ...hack out a regexp for the sed, it being my preference.
    I was trying to use sed initially but it turned out it's quit hard to make sed non-greedy, as you can see, for example, regexp
    .*://.*/ in sed
    will try to match
    http://www.blahblahexample.com/stuff/
    instead of
    http://www.blahblahexample.com/

    That's why I turned to perl instead. But if have to use sed, you might try ssed which is said to support non-greedy matching.

  7. #7
    Banned
    Join Date
    Jun 2009
    Posts
    68
    It's easy to do what you're talking about using sed, secondmouse. I'd have to brush up on sed, but it's easy once you know the details. PERL is also greedy. I use PERL and read a lot of tutorials/documentation on it. It's greedy, greedy, greedy. If you know how to do things, you know how to do things. Same with sed. No offense.

  8. #8
    Linux User
    Join Date
    May 2008
    Location
    NYC, moved from KS & MO
    Posts
    251
    @nopycckn, it's true by default reg matching is greedy, even for perl, but you can turn that around by using modifier ?. I didn't say sed can not do things like this, I just said perl makes this kind of task easier.

    Also the the domain name and protocol in your sed example is hard coded, if there is a link like https://site.example.com/whatever/index.html, you'd have to write another sed for that link too?

  9. #9
    Banned
    Join Date
    Jun 2009
    Posts
    68
    I didn't notice the "general purpose". The PERL won't work if the original href is, say, "/some.jpg".
    The sed tutorials on IBM's site were very helpful to me for the knowledge of sed that I have, and using a sed script, if you look at part III of the tutorial on IBM's site, makes very complex jobs very easy to do. If you don't do it yourself or get help soon, or if you want to brush up on sed or just read the tutorial anyways:
    Common threads: Sed by example, Part 1 is the first of their three-page tutorial.
    Didn't mean any offense, secondmouse, just pointing out that what is known is what is easy.

  10. #10
    Linux Newbie egan's Avatar
    Join Date
    Feb 2009
    Location
    Mountain View, CA
    Posts
    132
    Thanks for all the help and advice guys.

    I will go and read the IBM sed information. The only stuff I know comes from the man pages, which are a bit cryptic for things such as sed.

Page 1 of 2 1 2 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...