Find the answer to your Linux question:
Results 1 to 5 of 5
Hi guys, I'm new to scripting and feeling my way through. I've been working on a bash script using wget to get pieces of information from the internet, but I've ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    Jan 2014
    Posts
    3

    Newbie text manipulation question


    Hi guys,
    I'm new to scripting and feeling my way through.

    I've been working on a bash script using wget to get pieces of information from the internet, but I've run across a problem.

    wget will download the web source fine, but I'm struggling to grep the useful info out of the resulting file, because often there will be more than one piece of useful data per line, ie. I'm trying to grep particular urls out of the source and there will be more than one url on each line.

    My standard way of using grep would be something like
    cat wgetdownload.txt | grep urlbasegoeshere | cut -d '/' -f 6

    But this method will miss data points because of the multiple urls on each line.

    Is there an easy way to ask the computer to read through the file, and every time it sees a particular string, print what's directly afterwards, rather than printing every line where that string occurs?

    Thanks for your help.

  2. #2
    Just Joined!
    Join Date
    Oct 2012
    Posts
    13
    Quote Originally Posted by scattermouse_ View Post
    Hi guys,
    I'm new to scripting and feeling my way through.

    I've been working on a bash script using wget to get pieces of information from the internet, but I've run across a problem.

    wget will download the web source fine, but I'm struggling to grep the useful info out of the resulting file, because often there will be more than one piece of useful data per line, ie. I'm trying to grep particular urls out of the source and there will be more than one url on each line.

    My standard way of using grep would be something like
    cat wgetdownload.txt | grep urlbasegoeshere | cut -d '/' -f 6

    But this method will miss data points because of the multiple urls on each line.

    Is there an easy way to ask the computer to read through the file, and every time it sees a particular string, print what's directly afterwards, rather than printing every line where that string occurs?

    Thanks for your help.
    you do not have to abuse ``cat`` and pipe the output to `grep``, but instead use ``grep`` directly on the file, for example:

    Code:
    grep <STRING> /path/to/file
    as far as I understand what you're trying to achieve, why not using some REGEX to get all URL's from a website, for example:

    Code:
    curl --silent <DOMAIN> | grep -Eo '(https?|ftp|file)://[-A-Za-z0-9\+&@#/%?=~_|!:,.;]*[-A-Za-z0-9\+&@#/%=~_|]'
    you can also use ``lynx`` to extract urls from a website, for example:

    Code:
    lynx -dump -listonly <DOMAIN>
    or you can even use the following:

    Code:
    curl --silent <DOMAIN> | \
    grep -o '<a href=['"'"'"][^"'"'"']*['"'"'"]' |  \
    sed -e 's/^<a href=["'"'"']//' -e 's/["'"'"']$//'

  3. #3
    Just Joined!
    Join Date
    Jan 2014
    Posts
    3
    Nice tip about not needing cat, thanks.

    Maybe I should have been more specific about what I'm trying to do.

    I'm the admin of a large group on chess.com, and I'd like to see which of my members haven't played in any of the group matches.
    I can easily grab a list of members, but we've played a lot of matches so the strategy I came up with (which is probably rubbish) is:

    Go to the list of matches, this is spread over many pages but the format is url/page=1, url/page=2, etc so it's obviously easy to create a list of these urls.

    On each of these pages are links to matches, if I can wget (or one of the other methods you suggested above) the web source and then grep out the links to matches, I can then repeat this step and go to all of the separate match pages, and grab the users there, then I can sort and compare to my users list.

    It's very clumsy, but it's the best I've got.

    Can I use the lynx command you suggested with a text file of urls as input? Otherwise I'd need to manually input a lot of urls over two iterations.

  4. #4
    Linux Engineer
    Join Date
    Dec 2013
    Posts
    1,048
    It the text file of urls is html lynx can parse them. It can download thepages or read them from disc. If on the web:
    Code:
    lynx -dump -listonly http://www.google.ca/
    If on disc:
    Code:
    lynx -dump -listonly /path/to/file/index.html

  5. #5
    Just Joined!
    Join Date
    Jan 2014
    Posts
    3
    Thanks gregm, I'll give that a go this evening.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •