Find the answer to your Linux question:
Results 1 to 2 of 2
Hi All, This is a bit off topic but I'm trying to create an essential reading list for myself and have come across this site: Lists of Bests What I ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Linux Guru jmadero's Avatar
    Join Date
    Jul 2007
    Location
    California
    Posts
    2,003

    Script to Extract from Website


    Hi All,

    This is a bit off topic but I'm trying to create an essential reading list for myself and have come across this site:

    Lists of Bests


    What I am hoping to do is make a compilation using books that others have in their list, such as this one:

    jsherry's "Recommended Reading List for the Well Educated Adult" on Lists of Bests


    I've now entered 200 book titles, including author and what not, and it's taking too long. So now the question is, can I make an easy script to extract the book titles (and maybe even the links to amazon that they have) from certain lists? I'm not even sure if this is legal (if it's not, I apologize in advance and don't need to know how to do it )


    Thanks in advance
    Bodhi 1.3 & Bodhi 1.4 using E17
    Dell Studio 17, Intel Graphics card, 4 gigs of RAM, E17

    "The beauty in life can only be found by moving past the materialism which defines human nature and into the higher realm of thought and knowledge"

  2. #2
    Trusted Penguin
    Join Date
    May 2011
    Posts
    4,353
    You mean you want to grab the titles from the page on Amazon? That's perfectly legal. Use wget, something like:
    Code:
     wget -O books-page1.html http://www.listsofbests.com/list/2366-recommended-reading-list-for-the-well-educated-adult?name=comment_page&page=1
    Then parse the output file, "books-page1.html". You can use html2text.py (google for it), to convert the html to plain text and then grep for the titles. A program called "tidy" might help you clean up the html, too. If you wanted to grab the links, then you'd have to parse the html itself, not a plain text version of it.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •