Find the answer to your Linux question:
Results 1 to 9 of 9
Hey all, first post here. Think this is right forum! Basically I'm trying to grep through a massive code base, looking for HTML image tags that don't have an ALT ...
  1. #1
    Just Joined!
    Join Date
    Jan 2010
    Posts
    6

    grep / regular expression

    Hey all, first post here. Think this is right forum!

    Basically I'm trying to grep through a massive code base, looking for HTML image tags that don't have an ALT attribute. I'm pretty rubbish when it comes to using regular expressions with grep to be honest, though I have a good understanding of PCRE and a moderate ability with grep.

    So far I've been trying stuff like:

    $ grep -e '<img[(^alt)]*/>';
    $ grep -e '<img[^alt]*/>';

    But just not getting anywhere (too Perl I think really).

    Could anybody please help?

    Thanks

    Adam

  2. #2
    Just Joined! regexorcist's Avatar
    Join Date
    Jan 2010
    Location
    ~/
    Posts
    12
    You can do w/ regex, but
    Code:
    grep -v
    is really what your looking for.

  3. #3
    Just Joined! regexorcist's Avatar
    Join Date
    Jan 2010
    Location
    ~/
    Posts
    12
    In case I was too vague with my last post...

    It's a simple matter of
    Code:
    cat <your file> | grep -e "<img" | grep -v "alt"
    This assumes that image tag pairs are each on their own lines which they aren't so you will need sed to add the newlines and a pipeline like this...

    Code:
    cat <your file> | sed -e 's/<img/\n<img/' -e 's/<\/img>/<\/img>\n/' | grep -e "<img" | grep -v "alt"
    That should give you what your looking for

  4. #4
    Just Joined!
    Join Date
    Jan 2010
    Posts
    6
    Great thanks, I'll give this a try Monday morning when I'm back in work.

    Think it would need to be:

    Code:
    grep -e "<img" | grep -v "alt="
    ... No? Otherwise it could match images that just have "alt" in the file name or title.

    Thanks again though,
    Adam

  5. #5
    Trusted Penguin Cabhan's Avatar
    Join Date
    Jan 2005
    Location
    Seattle, WA, USA
    Posts
    3,230
    Is your HTML document well-formed? Which is to say, is it a valid XHTML document?

    If so, then your best bet would actually be to use XPath. XPath basically parses your XML document and then allows you to find elements in it based on elements and their attributes.

    You and regexorcist have both pointed out the problems with using grep: you have to deal with multiple entries on a single line and you have to worry about matching the wrong "alt" (even your suggestion is theoretically not correct, as an image could be called "alt=.png", for instance). XPath works on an XML parse tree, not lines of text.

    Now, if you want to automate correction of the missing alt tags, your solution may be good enough (my counterexample above is admittedly rare), but if you're just looking for tags that are missing the element, XPath may be easiest.

    In your case, the XPath query that you would want is:
    Code:
    //img[not(@alt)]
    This will select all img elements missing an alt attribute.

    To do all of this, you can use the commandline "xpath" utility. As a test that what I have told you is true:
    Code:
    bricka@joust ~ $ echo "<html><body><img src='hello.jpg' /><img src='goodbye.jpg' alt='Goodbye' /></body></html>" | xpath -e '//img[not(@alt)]'
    Found 1 nodes in stdin:
    -- NODE --
    <img src="hello.jpg" />
    The XML document that I passed in had two <img> elements, one without an alt attribute. And indeed, xpath found it!

    I hope this helps.
    DISTRO=Arch
    Registered Linux User #388732

  6. #6
    Just Joined!
    Join Date
    Jan 2010
    Posts
    6
    The coding is generally XHTML compliant, though there will be some instances I imagine where it's broken/invalid. Would this cause any problems?

    Could you recommend a good xpath tool to use?

    Thanks
    Adam

  7. #7
    Trusted Penguin Cabhan's Avatar
    Join Date
    Jan 2005
    Location
    Seattle, WA, USA
    Posts
    3,230
    If it is not a valid XML document, that might be an issue. Whether the xpath utility will store work is based on how its XML parser works: you might still be able to use xpath for the valid portions of the document.

    I used the "xpath" commandline utility above, which I have used many times before. I like it because of its simplicty. However, I'm sure there are others out there: you could always load up your HTML documents in Firefox and use Firebug and Firefinder to search the document with XPath. This might work a bit better with a malformed document because it probably uses Firefox's interpretation of the website, which may fill in the holes of a malformed document.

    I'm sure that you can find other utilities as well, but these are just two quick suggestions.
    DISTRO=Arch
    Registered Linux User #388732

  8. #8
    Just Joined!
    Join Date
    Jan 2010
    Posts
    6
    Hmm problem being we're not talking about just 1 document, closer to about 5,000 lol.

    Would the xpath utility still be effective? I tried a "man xpath" but wasn't found on the server, do you have a link I could download it from?

    Thanks again.
    Adam

  9. #9
    Trusted Penguin Cabhan's Avatar
    Join Date
    Jan 2005
    Location
    Seattle, WA, USA
    Posts
    3,230
    Yeah, this is why a commandline utility is probably better. Much easier to automate .

    I believe that the xpath utility I mentioned is a part of the Perl XML::XPath bundle, but I just found another option called xmlstarlet:
    XMLStarlet Command Line XML Toolkit: Overview

    Code:
    alex@danu:~$ echo "<html><body><img src='hello.jpg' /><img src='goodbye.jpg' alt='Goodbye' /></body></html>" | xmlstarlet sel -t -c '//img[not(@alt)]'
    <img src="hello.jpg"/>
    This should be fairly simple to automate.
    DISTRO=Arch
    Registered Linux User #388732

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...