Filtering out lines based on substrings
Here's what I'm trying to do:
1.Gather several 'hosts' files
2. Merge them.
3. Find and delete duplicates.
4. Optimize the result.
I found out how to do 1-3 with 'sort'. Step 4 is what trying to figure out next.
Let's say I have this file:
I would like to remove the sub.example.com as is it made redundant by example.com and some.sub.other2.com because of other2.com.
I think is can be done by:
For every line CUR_LINE in file
- Take parf of CUR_LINE that comes after "127.0.0.1 " -> CUR_PART //after space too!
- For every line COMP_LINE in same file
-- Take parf of COMP_LINE that comes after "127.0.0.1 " -> COMP_PART
-- If COMP_PART is a substring of COMP_LINE that comes at the end of COMP_LINE and Length(COMP_PART)is smaller than Length(COMP_LINE)
--- Delete (CUR_LINE)
Do excuse my mangled up pseudocode.
What I'm trying to show here is that the actual domain part (without the preceding "127.0.0.1 ") of every line is compared to the actual domain part of every other line and if one is a substring of another, but only at the end and of smaller length, the longer one is deleted.
The problem is that being the newbie that I am, I have no idea how to achieve this, and frankly, I'm not even sure which search keywords to use.
What commands would be useful here? How do I do this?