Find the answer to your Linux question:
Results 1 to 3 of 3
I am trying to extract a web page via Google for processing. I am able to create a proper query and test it using cut/paste into the address bar of ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Linux User
    Join Date
    Mar 2008
    Posts
    287

    Cant extract web page


    I am trying to extract a web page via Google for processing.
    I am able to create a proper query and test it using cut/paste into the address bar of my firefox browser.
    When I attempt to extract the page with wget:
    wget -O - -q "$query"
    I do not see the information that is present when I used the browser.
    Would someone explain why to me?

  2. #2
    Linux Engineer GNU-Fan's Avatar
    Join Date
    Mar 2008
    Posts
    935
    When a browser requests internet pages via HTTP, it sends some information identifying itself. Like the name of the browser and its version or language settings.
    For wget, it looks like this
    Code:
    User-Agent: Wget/1.12 (linux-gnu)
    If the website provider does not want to answer calls from wget, it can deny access based on that identification.

    But wget can disguise itself as another browser. Use "-U $AGENTSTRING", where the string is something from UserAgentString.com - List of User Agent Strings
    Debian GNU/Linux -- You know you want it.

  3. #3
    Linux User
    Join Date
    Mar 2008
    Posts
    287

    Cant extract web page

    Perhaps I need to be more specific. I wrote a script to capture Naval Observatory time and used that to set my computer clock. There I used:
    wget What time is it? --output-document=/tmp/timFil.tmp
    I was able to relate the timFil.tmp information to what one gets just using a browser, i.e. readable.
    Now I am trying to grab the web page output that I would get by entering e.g. 'congo' as the keyword in a search string. I do this with a query that is:
    query=http://www.google.com/#hl=en&q=congo and
    wget -l 1 -O inf.pag -q "$query"
    but the contents of inf.pag contains no search from the "search bar", or "About xx,xxx,xxx results (0.xx seconds)", which I see with the browser, but does contain functions regarding time. This looks to be the same as the source page which I got via Firefox browser (have not verified it tho) when I entered the query directly into the search bar.
    If I use the query for the Naval Obs. I get a comprehendable source page which I can manipulate.
    Is the difference in content due to what follows the last "/" in both queries? If so, how can I go about capturing the search time?

  4. $spacer_open
    $spacer_close

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •