Find the answer to your Linux question:
Results 1 to 3 of 3
Hello all, Here's what I'm trying to do: 1.Gather several 'hosts' files 2. Merge them. 3. Find and delete duplicates. 4. Optimize the result. I found out how to do ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    Sep 2011
    Posts
    4

    Filtering out lines based on substrings


    Hello all,

    Here's what I'm trying to do:

    1.Gather several 'hosts' files
    2. Merge them.
    3. Find and delete duplicates.
    4. Optimize the result.

    I found out how to do 1-3 with 'sort'. Step 4 is what trying to figure out next.

    Let's say I have this file:

    127.0.0.1 locahost
    127.0.0.1 sub.example.com
    127.0.0.1 example.com
    127.0.0.1 other.com
    127.0.0.1 other2.com
    127.0.0.1 some.sub.other2.com

    I would like to remove the sub.example.com as is it made redundant by example.com and some.sub.other2.com because of other2.com.

    I think is can be done by:

    For every line CUR_LINE in file
    - Take parf of CUR_LINE that comes after "127.0.0.1 " -> CUR_PART //after space too!
    - For every line COMP_LINE in same file
    -- Take parf of COMP_LINE that comes after "127.0.0.1 " -> COMP_PART
    -- If COMP_PART is a substring of COMP_LINE that comes at the end of COMP_LINE and Length(COMP_PART)is smaller than Length(COMP_LINE)
    --- Delete (CUR_LINE)

    Do excuse my mangled up pseudocode.
    What I'm trying to show here is that the actual domain part (without the preceding "127.0.0.1 ") of every line is compared to the actual domain part of every other line and if one is a substring of another, but only at the end and of smaller length, the longer one is deleted.

    The problem is that being the newbie that I am, I have no idea how to achieve this, and frankly, I'm not even sure which search keywords to use.

    What commands would be useful here? How do I do this?

    Thanks.

  2. #2
    Trusted Penguin Irithori's Avatar
    Join Date
    May 2009
    Location
    Munich
    Posts
    3,444
    My first impulse was to (again) suggest a dns server, which might be still valid as it offers a central point of configuration.

    But what is the deal with having all those hosts point to localhost?
    You must always face the curtain with a bow.

  3. #3
    Just Joined!
    Join Date
    Sep 2011
    Posts
    4
    The hosts file pointing to 127.0.0.1 or 0.0.0.0 is an anti ad/tracking/malware/etc measure. There are several resources with such file ready to go. I wanted to include some links, but it wouldn't let me due to low post count. A search for the keywords: hosts block ads brings relevant results, for those who may be interested.

    I want to create such a file to be used on Androd phones. I'm looking to combine some selected files from the sites available with one particular file that includes common domains for mobile tracking specifically. I would also like to optimize it in the way I described above just in case it makes a difference for the weaker mobile CPUs, and because it just makes sense (that is I like things like that to be neat and tidy).

    This also means that using a DNS server is not an option.

    Edit:
    Bases on this thread (note the missing dots) www linuxforums org/forum/networking/182629-does-dns-resolving-subdomains-go-through-main-domain.html this whole optimization idea may not work at all, but if someone wanted to give me some ideas about how this could be implemented, I'm still interested just for the learning.
    Last edited by textscript; 09-15-2011 at 02:38 AM.

  4. $spacer_open
    $spacer_close

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •