Find the answer to your Linux question:
Results 1 to 10 of 10
I'm getting ready to do some penetration testing of my system, and I have a wordlist I'd like to use during part of that process. There is a problem though: ...
  1. #1
    Just Joined! computer_freak_8's Avatar
    Join Date
    Sep 2008
    Location
    Des Moines, Iowa, USA
    Posts
    54

    Question Need someone to check Perl script, please.

    I'm getting ready to do some penetration testing of my system, and I have a wordlist I'd like to use during part of that process. There is a problem though: it has duplicates. So, I tried to write a Perl script to remove all duplicates, but I need someone to review my script before I try it out, just to be safe.
    Here's what I've got:
    Code:
    #!/usr/bin/perl -w
    
    # Try to get the wordlist file ready for use.
    
    $filepath = "wordlist.txt";
    
    sysopen (wl, +>, "$filepath") or die "$filepath cannot be opened.";
    
    while (<wl>)
     {
      $checkforthis="\n"."$_";
      while (<wl>)
       {
        s/"$chekforthis"/"\n";
       }
     }
    
    #EOF
    The file has one item per line, and I want there to be no empty lines where items are removed. It is okay if there is a blank like at the beginning and/or end of the file.

    So, will this work? Or would it crash my system if I tried running it? Please keep in mind this would be my first Perl script that I wrote myself, so I will probably need specifics if there is anything wrong with it (which I am sure that there are many things wrong with it - I just don't know what yet).

  2. #2
    Just Joined! computer_freak_8's Avatar
    Join Date
    Sep 2008
    Location
    Des Moines, Iowa, USA
    Posts
    54

    Exclamation

    Well, I tried it out, and made some changes, but I'm still stuck. Here's what I've got now:
    Code:
    #!/usr/bin/perl -w
    
    # Try to get the wordlist file ready for use.
    
    $filepath = "wordlist.txt";
    
    sysopen(WL, $filepath1, O_RDRW) or die "$filepath cannot be opened.";
    
    while (<WL>)
     {
      $checkforthis="\n"."$_";
      while (<WL>)
       {
        s/"$chekforthis"/"\n"/;
       }
     }
    sysclose(WL)
    #EOF
    And here's the output when I try to run it:
    Code:
    Name "main::filepath1" used only once: possible typo at ./perl-fix.pl line 7.
    Name "main::chekforthis" used only once: possible typo at ./perl-fix.pl line 14.
    Name "main::checkforthis" used only once: possible typo at ./perl-fix.pl line 11.
    Argument "O_RDRW" isn't numeric in sysopen at ./perl-fix.pl line 7.
    Use of uninitialized value in sysopen at ./perl-fix.pl line 7.
    wordlist.txt cannot be opened. at ./perl-fix.pl line 7.
    I looked here, but I still can't figure out what's wrong. It's driving me crazy...

    Anyone?

  3. #3
    Trusted Penguin Cabhan's Avatar
    Join Date
    Jan 2005
    Location
    Seattle, WA, USA
    Posts
    3,230
    First off, the way to remove all duplicates from a wordlist with one word per line in Bash is:
    Code:
    sort wordlist.txt | uniq
    "sort" does exactly what you expect. "uniq" takes a list of words and turns any run of the same word into exactly one. So:
    Code:
    one
    one
    two
    becomes

    Code:
    one
    two
    but

    Code:
    one
    two
    one
    stays the same.

    Now, to go to your code:

    First off, don't use sysopen. I've heard of that function, but I have no idea what it does. It probably requires some sort of module anyway (I think it requires Fcntl). Use regular "open" instead.

    Secondly, you have misspelled the name of your variable inside the second while loop.

    Thirdly, this simply won't work. Modifying $_ doesn't change the contents of a file.

    In Perl, the standard way to do duplicate detection is to use a hash. Therefore, if I was implementing this, I would do it this way:

    Code:
    #!/usr/bin/perl -w
    
    use strict;
    
    my $filepath = "wordlist.txt";
    
    open(my $file, "<", $filepath) or die "$filepath cannot be opened.";
    
    my %words;
    
    while(<$file>)
    {
        $words{$_} = 1;
    }
    
    print "$_\n" for(keys %words);
    What this does is loop through the file and make an entry in our hash for each word. Therefore, we end up with a hash whose keys are all of the words in the file. However, a hash treats duplicate keys as the same, so when we take the keys of the hash, we find a single copy of every word that was in the file. We write these out, and all duplicates have been removed.

    Does this make sense?
    DISTRO=Arch
    Registered Linux User #388732

  4. #4
    Just Joined! computer_freak_8's Avatar
    Join Date
    Sep 2008
    Location
    Des Moines, Iowa, USA
    Posts
    54
    Thanks so much for all that information! I do have a little bit of figuring out how this code works, though. I have added what it looks to me like it would do; could you please verify/correct my comments, just so I can better understand the code? (I'm trying to learn programming - kind of a rough road for me...)

    Quote Originally Posted by Cabhan View Post
    Code:
    my %words; # initialize the hash
    
    while(<$file>) # while we are not at the end of the file
    {
        $words{$_} = 1; # for each key in %words, set the value to 1 (???)
    }
    
    print "$_\n" for(keys %words); # print each of the keys in %words; don't print the values (?)
    I have put question marks next to the ones that I am more confused about.

    Thanks again for your time, effort, and willingness to help.

  5. #5
    Just Joined! computer_freak_8's Avatar
    Join Date
    Sep 2008
    Location
    Des Moines, Iowa, USA
    Posts
    54
    Okay, I've run into another problem. It just erases the contents of "wordlist.txt". Here's what I've got now:
    Code:
    #!/usr/bin/perl -w
    
    # Try to get the wordlist file ready for use.
    
    use strict;
    
    my $filepath = "wordlist.txt";
    
    open(my $file, "+>", $filepath) or die "$filepath cannot be opened.";
    
    my %words;
    
    while(<$file>)
     {
        $words{$_} = 1;
     }
    
    print "$_\n" for(keys %words);
    
    
    #EOF
    The same thing happens if I add:
    Code:
    close($file)
    just above the "#EOF". Any ideas?

  6. #6
    Trusted Penguin Cabhan's Avatar
    Join Date
    Jan 2005
    Location
    Seattle, WA, USA
    Posts
    3,230
    So, how does open() work?

    Code:
    open($filehandle, r/w, $filename)
    $filehandle is the handle that will be connected to the file. r/w is the direction of the handle. "<" means read, ">" means write, and ">>" means append. If you choose ">", you will clear the file!

    Now, on to your questions about the code I posted:

    Code:
    my %words; # initialize the hash
    
    while(<$file>) # while we are not at the end of the file
    {
        $words{$_} = 1; # for each key in %words, set the value to 1 (???) [1]
    }
    
    print "$_\n" for(keys %words); # print each of the keys in %words; don't print the values (?) [2]
    [1] What this line does is essentially create a key in the hash for each word. We don't care about the value. We are basically reading every word into memory, but only remembering a certain word once.

    Think about the naive way of reading a word into memory. We would have an array, and push each word onto it:
    Code:
    while(<$file>)
    {
        push @words, $_;
    }
    However, this remembers every word. Remember that no matter how many times you assign a value to a given hash key, the key only exists once (future writes to that key will overwrite the existing value). So let's follow my code execution through the following word list:
    Code:
    one
    two
    three
    two
    two
    one
    First, we are going to set $words{'one'} = 1. %words now has one key: "one". Then we do the same for "two" and "three". Now %words has three keys: "one", "two", and "three".

    Now we set $words{'two'} = 1. Well, %words already has a key "two". So there is no change. And similarly for the next 2 lines.

    After running this, if I print out the keys, I will get "one", "two", and "three", despite the fact that some of these appeared multiple times.


    Code:
    print "$_\n" for(keys %words);
    This line may be obvious now, but remember that we don't care about the values in this hash. We only care about the keys. So we're only printing those out.


    Does this make more sense?
    DISTRO=Arch
    Registered Linux User #388732

  7. #7
    Just Joined! computer_freak_8's Avatar
    Join Date
    Sep 2008
    Location
    Des Moines, Iowa, USA
    Posts
    54
    Quote Originally Posted by Cabhan View Post
    So, how does open() work?

    Code:
    open($filehandle, r/w, $filename)
    $filehandle is the handle that will be connected to the file. r/w is the direction of the handle. "<" means read, ">" means write, and ">>" means append. If you choose ">", you will clear the file!

    [...]

    Does this make more sense?
    First of all, yes, I think that makes a lot more sense now. Thanks!
    Next, yes, I misread the reference tutorial - I should have had "+<" instead of "+>".

    So, I replaced the ">" with a "<", and ran the script. This time, it hung for a few seconds, (as I would expect with a 37.4 MB file,)and then spit out text for about a minute or so. However, the text had blank lines in between the entries (not what I hoped for) and left the original file unmodified.

    So, not to be a pain, but now my questions are:
    1. How do I get rid of the blank lines? (I think "chomp" might come in handy, but I don't know how...)
    2. How do I make it save the new list back to the file? (That is, all the entries once, excluding duplicates.)

    Thanks again for your time. I think I'm learning a lot.

  8. #8
    Trusted Penguin Cabhan's Avatar
    Join Date
    Jan 2005
    Location
    Seattle, WA, USA
    Posts
    3,230
    No worries about being a pain. I've been doing Perl for a long time, and I've had to learn all of this stuff as well.

    1) You are correct. What is happening here is that we are reading each line of the file. Each line ends with a newline (obviously). However, when we print out the keys at the end, we are still appending a newline. So we end up with a newline followed by a newline: a blank line.

    chomp() will remove the last character of the given variable if that character is a newline. If it is not a newline, it has no effect. If no variable is given, it works on $_. So we modify our code:
    Code:
    #!/usr/bin/perl -w
    
    use strict;
    
    my $filepath = "wordlist.txt";
    
    open(my $file, "<", $filepath) or die "$filepath cannot be opened.";
    
    my %words;
    
    while(<$file>)
    {
        chomp;
        $words{$_} = 1;
    }
    
    print "$_\n" for(keys %words);
    That's the only change. It now chomps the line, and we use this new chomped line as our key (which is really what we wanted: now the key is a word, not a word followed by a newline).

    2) There are two approaches to this.

    Most utilities in UNIX will read from stdin and write to stdout by default. This way, the person who runs your program gets to choose where the input comes from and where the output goes to. In this example, we have hardcoded the input, but we print the output to stdout. Now, to direct it to a file, you run the program with:
    Code:
    ./remove_duplicates > no_duplicates_wordlist.txt
    This is a Bash command which means to direct all output to the given file. It has nothing at all to do with Perl.

    The other approach is to hardcode the output as well (or take it as an argument to the script, etc.). Suppose we want to output to the file "no_duplicates_wordlist.txt". We do the following:
    Code:
    #!/usr/bin/perl -w
    
    use strict;
    
    my $filepath = "wordlist.txt";
    
    open(my $file, "<", $filepath) or die "$filepath cannot be opened.";
    
    my %words;
    
    while(<$file>)
    {
        chomp;
        $words{$_} = 1;
    }
    
    open(my $outfile, ">", "no_duplicates_wordlist.txt") or die "no_duplicates_wordlist.txt cannot be opened for output.\n";
    
    print {$outfile} "$_\n" for(keys %words);
    We open the file for output just like we open a file for input, just with the output indicator (">"). The print() command then takes a hidden argument: if the first argument is a filehandle, then print to that. In fact, if you don't specify a filehandle, "STDOUT" is given implicitly.

    Does this make sense?
    DISTRO=Arch
    Registered Linux User #388732

  9. #9
    Linux User
    Join Date
    Aug 2006
    Posts
    458
    Quote Originally Posted by computer_freak_8 View Post
    So, I tried to write a Perl script to remove all duplicates, but I need someone to review my script before I try it out, just to be safe.
    to remove duplicates
    1) See perldoc -q duplicate
    2) perl -ne 'print $_ if !$a{$_}++' file

  10. #10
    Just Joined! computer_freak_8's Avatar
    Join Date
    Sep 2008
    Location
    Des Moines, Iowa, USA
    Posts
    54

    Thumbs up

    Quote Originally Posted by Cabhan View Post
    Does this make sense?
    Yeah, I was wondering about the chomp thing, or if I should not chomp, but just remove the "\n" part.

    Anyhow, here is my final (and working) script:
    Code:
    #!/usr/bin/perl -w
    
    # Try to get the wordlist file ready for use.
    
    use strict;
    
    my $filepath1 = "wordlist1.txt";
    my $filepath2 = "wordlist2.txt";
    
    open(my $filehandle1, "+<", $filepath1) or die "$filepath1 cannot be opened.";
    
    my %words;
    
    while(<$filehandle1>)
     {
        chomp;
        $words{$_} = 1;
     }
    close($filehandle1);
    
    
    open(my $filehandle2, ">", $filepath2) or die "$filepath2 cannot be opened.";
    
    print {$filehandle2} "$_\n" for(keys %words);
    
    close($filehandle2);
    
    
    #EOF
    Thanks a bunch! Oh, and I'm sure this won't be the last you see of me.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...