Results 1 to 10 of 17
Need help here... which command can i use to extract out the headline from some website using linux konsole???...
- 07-06-2007 #1Just Joined!
- Join Date
- Jul 2007
- Posts
- 59
Extracting website headlines
Need help here... which command can i use to extract out the headline from some website using linux konsole???
- 07-06-2007 #2
Not sure I fully understand you, but maybe wget is the command you're looking for?
Code:wget http://news.bbc.co.uk
Registered Linux user #388328 || Registered LFS user #15880
AMD 64 X2 4600+ :: 2X1GB DDR2 800 :: GeForce 9400 GT 512MB :: ASUS M2N32 Deluxe :: 4X250GB SATAII
Need instant help? Try us on IRC -- #linuxforums on freenode
- 07-06-2007 #3Just Joined!
- Join Date
- Jul 2007
- Posts
- 59
- 07-06-2007 #4Registered Linux user #388328 || Registered LFS user #15880
AMD 64 X2 4600+ :: 2X1GB DDR2 800 :: GeForce 9400 GT 512MB :: ASUS M2N32 Deluxe :: 4X250GB SATAII
Need instant help? Try us on IRC -- #linuxforums on freenode
- 07-09-2007 #5Just Joined!
- Join Date
- Jul 2007
- Posts
- 59
i got it... thx a lot =)
- 07-09-2007 #6Just Joined!
- Join Date
- Jul 2007
- Posts
- 59
- 07-09-2007 #7
As you may or may not know, sed often uses regular expressions to do its work. A regular expression is a way to match a pattern, as opposed to a block of literal text. For instance, if I want to match "Fred and Barney" and "Frank and Bob", I might use the regular expression "F[a-z]+ and B[a-z]+", which means 'F', followed by 1 or more lowercase letters, followed by ' and ', followed by 'B', followed by 1 or more lowercase letters.
In a regular expression, anything surrounded by [...] is called a character class. A character class is only one character long in the match, and defines what may be matched in that space. In this particular case, the character class begins with a '^', which means "This character class contains everything EXCEPT what I list". So this character class means "Everything except < and >".
The '*' is code for 0 or more of the preceding.
So in this case, the expression
means "Find every instance of a <...> surrounding characters that are not '<' or '>', and replace them with nothing.Code:s/<[^<>]*>//g
DISTRO=Arch
Registered Linux User #388732
- 07-09-2007 #8Linux Enthusiast
- Join Date
- Aug 2006
- Posts
- 631
I don't think here the ">" is necessary in the class [^<>] so this should also do the job:So in this case, the expression
Code:
means "Find every instance of a <...> surrounding characters that are not '<' or '>', and replace them with nothing.Code:s/<[^<>]*>//g
RegardsCode:s/<[^<]*>//g
- 07-13-2007 #9Just Joined!
- Join Date
- Jul 2007
- Posts
- 59
- 07-13-2007 #10
As I mentioned, '*' means 0 more more of the preceding. So the original regular expression to said match <...>, with anything in between.
By moving the '*', your new regular expression says "Match 0 more '<'s, followed by a single character that is not a '>' or '<', followed by a '>'".
The following match your new expression:
This does not match:Code:h> # 0 '<'s at the beginning <h> <<h> <<<<<<<<<<<<<<<<h> etc.
Does that make sense?Code:<hello> # you are only allowed to have a single character between the '<' and '>'
DISTRO=Arch
Registered Linux User #388732


Reply With Quote
