Find the answer to your Linux question:
Results 1 to 7 of 7
Hi, I've spent hours trying to figure this out without any luck. Would appreciate your help I need to "crawl" a website and search for a specific string in the ...
  1. #1
    Just Joined!
    Join Date
    Dec 2011
    Posts
    4

    Question Bash script: wget a website and save urls containing a specific string

    Hi,
    I've spent hours trying to figure this out without any luck. Would appreciate your help

    I need to "crawl" a website and search for a specific string in the html. If a string is found the url to the page containing it is saved to a file.

    So the end result should be a file containing a list of urls.

    I tried writing a bash script using wget. My knowledge of Linux is very basic. I am using Cygwin for Windows.

  2. #2
    Linux Enthusiast scathefire's Avatar
    Join Date
    Jan 2010
    Location
    Western Kentucky
    Posts
    616
    KrazyWorks » Wget examples and scripts

    number 5 sounds close to what you are trying to do
    linux user # 503963

  3. #3
    Just Joined!
    Join Date
    Dec 2011
    Posts
    4
    scathefire, thank you for taking the time to look into it.

    The thing is that I don't have a file with urls to loop through. I have the main url to the website (i.e. example.com), and I want wget to crawl it and check every page for my search string.

  4. #4
    Linux Enthusiast scathefire's Avatar
    Join Date
    Jan 2010
    Location
    Western Kentucky
    Posts
    616
    does the --spider option not work?

    of course, i don't think it will download the page persay. there are other software options out there though.
    linux user # 503963

  5. #5
    Just Joined!
    Join Date
    Dec 2011
    Posts
    4
    --spider only checks if the page exists, it doesn't get the actual page contents. I need the page contents to search for the string in the html.

    My plan B is to download the whole site with wget, and then do the search on my local version:
    wget -r -l 2 SITE_URL

    But this will not give me the list of URLs.

  6. #6
    Linux Enthusiast scathefire's Avatar
    Join Date
    Jan 2010
    Location
    Western Kentucky
    Posts
    616
    why not do something like:
    Code:
    wget -mk SITE_URL
    then, with your local copy, filter through the files.
    linux user # 503963

  7. #7
    Just Joined!
    Join Date
    Dec 2011
    Posts
    4
    Yes, this is what I ended up doing. Thank you for your help!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...