Results 1 to 10 of 12
I'm trying to write a shell script that will prepare website resources for offline use. Basically, even once you have saved the entire web page, the links are broken because ...
- 07-04-2009 #1
Complex Substitution - Stuck
I'm trying to write a shell script that will prepare website resources for offline use. Basically, even once you have saved the entire web page, the links are broken because they given as urls rather than relative paths.
Basically I want to be able to scan the index.html for <a href="*"*</a> and then replace the link with a file path. For example with:
<a href="http://www.example.org/stuff/fun.html"Fun Stuff</a>
with
<a href="./stuff/fun.html"Fun Stuff</a>
I have limited knowledge of perl and sed, and no knowledge of awk. Any advice would be greatly appreciated. I think the hardest part for me is passing the discovered file name to the text processor to be substituted.
- 07-05-2009 #2Linux User
- Join Date
- May 2008
- Location
- NYC, moved from KS & MO
- Posts
- 251
I guess you missed a > in
<a href="http://www.example.org/stuff/fun.html"Fun Stuff</a>
[ right before Fun ]
Anyway try the code below to see if it works:
Code:perl -p0e 's|<a href="(http://.*?\/)(.*")|<a href="./$2|g' index.html
- 07-05-2009 #3Just Joined!
- Join Date
- Apr 2009
- Posts
- 33
original: <a href="http://www.example.org/stuff/fun.html">Fun Stuff</a>
sed 's?\(href="\)//http:www\.example\.org\([^>]*\)?\1.\2?g'
result: <a href="./stuff/fun.html">Fun Stuff</a>
actually you don't need the leading dot '.':
sed 's?\(href="\)//http:www\.example\.org/\([^>]*\)?\1\2?g'
result: <a href="stuff/fun.html">Fun Stuff</a>
would work perfectly
<a href="http://www.example.org/stuff/fun.html"Fun Stuff</a>
with
<a href="stuff/fun.html">Fun Stuff</a>
I assume that it's in lowercase but to be on the safe side, if you want
to cover cases where you have
HREF=... or Href=... etc
you'd better replace this
sed 's?\(href="\)//http:www\.example\.org/\([^>]*\)?\1\2?g'
by that
sed 's?[hH][rR][eE][fF]="//http:www\.example\.org/\([^>]*\)?href="\1?g'Last edited by vonbiber; 07-05-2009 at 02:37 PM. Reason: typo
- 07-05-2009 #4
The missing > was just a typo...
Looks like you guys gave me quite a lot to chew on. I'll see if I can decipher that...
Thanks a bunch.
- 07-05-2009 #5
I'm sorry; upon looking at this I see that I may not have explained adequately. I am trying to get a general purpose script, where the domains and protocols of the links themselves are not known.
It looks like the perl script does this though and I ought to be able to hack out a regexp for the sed, it being my preference.
- 07-05-2009 #6Linux User
- Join Date
- May 2008
- Location
- NYC, moved from KS & MO
- Posts
- 251
Simply modify the original perl one-liner intowhere the domains and protocols of the links themselves are not known.
should solve the problem.Code:perl -p0e 's|<a href="(.*://.*?/)(.*")|<a href="./$2|g' index.html
I was trying to use sed initially but it turned out it's quit hard to make sed non-greedy, as you can see, for example, regexp...hack out a regexp for the sed, it being my preference.
.*://.*/ in sed
will try to match
http://www.blahblahexample.com/stuff/
instead of
http://www.blahblahexample.com/
That's why I turned to perl instead. But if have to use sed, you might try ssed which is said to support non-greedy matching.
- 07-06-2009 #7Banned
- Join Date
- Jun 2009
- Posts
- 68
It's easy to do what you're talking about using sed, secondmouse. I'd have to brush up on sed, but it's easy once you know the details. PERL is also greedy. I use PERL and read a lot of tutorials/documentation on it. It's greedy, greedy, greedy. If you know how to do things, you know how to do things. Same with sed. No offense.
- 07-06-2009 #8Linux User
- Join Date
- May 2008
- Location
- NYC, moved from KS & MO
- Posts
- 251
@nopycckn, it's true by default reg matching is greedy, even for perl, but you can turn that around by using modifier ?. I didn't say sed can not do things like this, I just said perl makes this kind of task easier.
Also the the domain name and protocol in your sed example is hard coded, if there is a link like https://site.example.com/whatever/index.html, you'd have to write another sed for that link too?
- 07-06-2009 #9Banned
- Join Date
- Jun 2009
- Posts
- 68
I didn't notice the "general purpose". The PERL won't work if the original href is, say, "/some.jpg".
The sed tutorials on IBM's site were very helpful to me for the knowledge of sed that I have, and using a sed script, if you look at part III of the tutorial on IBM's site, makes very complex jobs very easy to do. If you don't do it yourself or get help soon, or if you want to brush up on sed or just read the tutorial anyways:
Common threads: Sed by example, Part 1 is the first of their three-page tutorial.
Didn't mean any offense, secondmouse, just pointing out that what is known is what is easy.
- 07-07-2009 #10
Thanks for all the help and advice guys.
I will go and read the IBM sed information. The only stuff I know comes from the man pages, which are a bit cryptic for things such as sed.


Reply With Quote
