Find the answer to your Linux question:
Results 1 to 2 of 2
Suppose I have a .htm file that has a lot of complex HTML code. I want to run a command from the shell that says "For this file index.htm extract ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    Dec 2011
    Posts
    3

    Easy way to extract HTML text from a .htm file


    Suppose I have a .htm file that has a lot of complex HTML code. I want to run a command from the shell that says "For this file index.htm extract all anchor tags (<a href...) and everything inside them and ending with .../a> and post it to standard output and separate each with a newline." What utility should I use for this? Should I use sed? awk? vi?

  2. #2
    Just Joined!
    Join Date
    Mar 2007
    Location
    Bogotá, Colombia
    Posts
    43
    Well, I haven't tried this yet, but top of my head, I think that this should work:

    Code:
    cat index.htm | sed 's/a>/a>\n/g' | egrep "<a |</a>" | sed 's/^.*<a /<a /'

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •