Find the answer to your Linux question:
Page 1 of 2 1 2 LastLast
Results 1 to 10 of 17
Need help here... which command can i use to extract out the headline from some website using linux konsole???...
  1. #1
    Just Joined!
    Join Date
    Jul 2007
    Posts
    59

    Extracting website headlines

    Need help here... which command can i use to extract out the headline from some website using linux konsole???

  2. #2
    Linux Guru smolloy's Avatar
    Join Date
    Apr 2005
    Location
    CA, but from N.Ireland
    Posts
    2,413
    Not sure I fully understand you, but maybe wget is the command you're looking for?

    Code:
    wget http://news.bbc.co.uk
    Registered Linux user #388328 || Registered LFS user #15880
    AMD 64 X2 4600+ :: 2X1GB DDR2 800 :: GeForce 9400 GT 512MB :: ASUS M2N32 Deluxe :: 4X250GB SATAII
    Need instant help? Try us on IRC -- #linuxforums on freenode

  3. #3
    Just Joined!
    Join Date
    Jul 2007
    Posts
    59
    Quote Originally Posted by smolloy View Post
    Not sure I fully understand you, but maybe wget is the command you're looking for?

    Code:
    wget http://news.bbc.co.uk


    actually i wget the website already, and i grep the headlines i wan but other than the headlines, there is still other words like , <a href........>, i wan to filter out this <a href> thing... wat command can i use??

  4. #4
    Linux Guru smolloy's Avatar
    Join Date
    Apr 2005
    Location
    CA, but from N.Ireland
    Posts
    2,413
    Quote Originally Posted by jeffrey_seeNJ View Post
    actually i wget the website already, and i grep the headlines i wan but other than the headlines, there is still other words like , <a href........>, i wan to filter out this <a href> thing... wat command can i use??
    Google is your friend

    I found this solution
    Code:
    smolloy@sabayonx86-64 ~ $ cat testfile.txt
    <a href>blah blah</a><br>some more text<a href>more text</a>
    smolloy@sabayonx86-64 ~ $ sed 's/<[^<>]*>//g' testfile.txt
    blah blahsome more textmore text
    Registered Linux user #388328 || Registered LFS user #15880
    AMD 64 X2 4600+ :: 2X1GB DDR2 800 :: GeForce 9400 GT 512MB :: ASUS M2N32 Deluxe :: 4X250GB SATAII
    Need instant help? Try us on IRC -- #linuxforums on freenode

  5. #5
    Just Joined!
    Join Date
    Jul 2007
    Posts
    59
    i got it... thx a lot =)

  6. #6
    Just Joined!
    Join Date
    Jul 2007
    Posts
    59
    Quote Originally Posted by smolloy View Post
    Google is your friend

    I found this solution
    Code:
    smolloy@sabayonx86-64 ~ $ cat testfile.txt
    <a href>blah blah</a><br>some more text<a href>more text</a>
    smolloy@sabayonx86-64 ~ $ sed 's/<[^<>]*>//g' testfile.txt
    blah blahsome more textmore text


    ya...thx a lot...by the way...can u explain [^<>]*does wat?

  7. #7
    Trusted Penguin Cabhan's Avatar
    Join Date
    Jan 2005
    Location
    Seattle, WA, USA
    Posts
    3,230
    As you may or may not know, sed often uses regular expressions to do its work. A regular expression is a way to match a pattern, as opposed to a block of literal text. For instance, if I want to match "Fred and Barney" and "Frank and Bob", I might use the regular expression "F[a-z]+ and B[a-z]+", which means 'F', followed by 1 or more lowercase letters, followed by ' and ', followed by 'B', followed by 1 or more lowercase letters.

    In a regular expression, anything surrounded by [...] is called a character class. A character class is only one character long in the match, and defines what may be matched in that space. In this particular case, the character class begins with a '^', which means "This character class contains everything EXCEPT what I list". So this character class means "Everything except < and >".

    The '*' is code for 0 or more of the preceding.

    So in this case, the expression
    Code:
    s/<[^<>]*>//g
    means "Find every instance of a <...> surrounding characters that are not '<' or '>', and replace them with nothing.
    DISTRO=Arch
    Registered Linux User #388732

  8. #8
    Linux Enthusiast
    Join Date
    Aug 2006
    Posts
    631
    So in this case, the expression
    Code:

    Code:
    s/<[^<>]*>//g
    means "Find every instance of a <...> surrounding characters that are not '<' or '>', and replace them with nothing.
    I don't think here the ">" is necessary in the class [^<>] so this should also do the job:

    Code:
    s/<[^<]*>//g
    Regards

  9. #9
    Just Joined!
    Join Date
    Jul 2007
    Posts
    59
    Quote Originally Posted by smolloy View Post
    Google is your friend

    I found this solution
    Code:
    smolloy@sabayonx86-64 ~ $ cat testfile.txt
    <a href>blah blah</a><br>some more text<a href>more text</a>
    smolloy@sabayonx86-64 ~ $ sed 's/<[^<>]*>//g' testfile.txt
    blah blahsome more textmore text


    i try putting the asterisk in front but hw come it doesnt work at all?? sed 's/<*[^<>]>//g'

  10. #10
    Trusted Penguin Cabhan's Avatar
    Join Date
    Jan 2005
    Location
    Seattle, WA, USA
    Posts
    3,230
    As I mentioned, '*' means 0 more more of the preceding. So the original regular expression to said match <...>, with anything in between.

    By moving the '*', your new regular expression says "Match 0 more '<'s, followed by a single character that is not a '>' or '<', followed by a '>'".

    The following match your new expression:
    Code:
    h> # 0 '<'s at the beginning
    <h>
    <<h>
    <<<<<<<<<<<<<<<<h>
    etc.
    This does not match:
    Code:
    <hello> # you are only allowed to have a single character between the '<' and '>'
    Does that make sense?
    DISTRO=Arch
    Registered Linux User #388732

Page 1 of 2 1 2 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...