Find the answer to your Linux question:
Results 1 to 4 of 4
I tried to download a multipage html document but all I got was the first page and "robots.txt": Code: User-agent: * Disallow: /email/ Disallow: /cgi-bin/james Disallow: /misc/ Disallow: /stuff/ Disallow: ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Linux Guru
    Join Date
    May 2004
    Location
    forums.gentoo.org
    Posts
    1,817

    Is wget blocked?


    I tried to download a multipage html document but all I got was the first page and "robots.txt":
    Code:
    User-agent: *
    Disallow: /email/
    Disallow: /cgi-bin/james
    Disallow: /misc/
    Disallow: /stuff/
    Disallow: /test/
    Disallow: /library/images/
    Disallow: /library/js/
    Disallow: /library/css/
    Disallow: /yourselections/
    Disallow: /links/
    Disallow: /estore/
    Disallow: /site/
    Disallow: /comments/
    Disallow: /general-comments/
    Disallow: /register/
    Disallow: /admin/
    Disallow: /oradoc/
    Disallow: /wp/display/117/
    
    User-agent: WebReaper
    User-agent: Anawave
    User-agent: EmailCollector
    User-agent: EmailSiphon
    User-agent: ExtractorPro
    User-agent: FlashSite
    User-agent: Go-Get-It
    User-agent: Grab-a-Site
    User-agent: HotCargo
    User-agent: HttpLoader
    User-agent: MemoWeb
    User-agent: NearSite
    User-agent: NetAttache
    User-agent: Radview
    User-agent: Radview/HttpLoader
    User-agent: Second Site
    User-agent: SecondSite
    User-agent: SiteSnagger
    User-agent: SpidyBot
    User-agent: Teleport
    User-agent: Teleport Pro
    User-agent: Visual Web
    User-agent: VisualWeb
    User-agent: WBI_Client
    User-agent: WebCompass
    User-agent: WebCopy
    User-agent: WebDownloader
    User-agent: WebRetriever
    User-agent: WebSnake
    User-agent: WebVCR
    User-agent: WebWhacker
    User-agent: WebZIP
    User-agent: Wget
    Disallow: /
    I'm pretty sure that I know what that means, but is there a way around it?
    /IMHO
    //got nothin'
    ///this use to look better

  2. #2
    Linux Engineer Giro's Avatar
    Join Date
    Jul 2003
    Location
    England
    Posts
    1,219
    You should respect peoples robots.txt, but if you want to get the files just use a grabber that you can change the UA(user agent) of, cause this is causing the problem to change the UA of wget use the command below.

    Code:
    wget --user-agent=MyGreatBot http://target.com

  3. #3
    Linux Guru
    Join Date
    May 2004
    Location
    forums.gentoo.org
    Posts
    1,817
    Thanks, Giro. I don't really know the bounds of good etiquette on his sort of thing. I see in the man page for wget it says this about --user agent:
    Quote Originally Posted by man wget
    Use of this option is discouraged, unless you really know what you are doing.
    I don't even know what that means, except that I do know it's directed to people like me.
    /IMHO
    //got nothin'
    ///this use to look better

  4. #4
    Linux Enthusiast puntmuts's Avatar
    Join Date
    Dec 2004
    Location
    Republic Banana
    Posts
    562
    That means that if you use that option you will be ripping that site against the will of the webmaster. The message tells you to think about that and if that really is what you want to do .
    I\'m so tired .....
    #200472

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •