Find the answer to your Linux question:
Results 1 to 5 of 5
I'm new to shell scripts and I need one that can edit a large txt file. Every line in the file has a large number on it, and I need ...
  1. #1
    Just Joined! otkaz's Avatar
    Join Date
    Feb 2009
    Location
    Houston, TX
    Posts
    19

    need help; shell script for editing large txt files

    I'm new to shell scripts and I need one that can edit a large txt file. Every line in the file has a large number on it, and I need to remove lines containing any number with 4 or more repeating characters anywhere within the number ie 123456789 would be fine but 111123456, 023444456, and 018399991 all would not. I tried reading a guide on shell scripting, but its taking me too long to figure it out and I need this done for work asap. Any help is very appreciated. Sorry if my post and explanation sounds rushed, but it was.

  2. #2
    tpl
    tpl is offline
    Linux User
    Join Date
    Jan 2007
    Location
    cleveland
    Posts
    452
    welcome to the forum

    you can use awk's interval expressions, like this:

    awk --posix '/1{4}|4{4}|9{4}/ {print}' <filename

    add the other alternatives as you like. r{4} matches at least 4 r's
    the sun is new every day (heraclitus)

  3. #3
    Just Joined! otkaz's Avatar
    Join Date
    Feb 2009
    Location
    Houston, TX
    Posts
    19
    Quote Originally Posted by tpl View Post
    welcome to the forum

    you can use awk's interval expressions, like this:

    awk --posix '/1{4}|4{4}|9{4}/ {print}' <filename

    add the other alternatives as you like. r{4} matches at least 4 r's
    thanks for the response I get...
    awk: not an option: --posix

  4. #4
    Just Joined!
    Join Date
    Oct 2004
    Posts
    62
    Hi otkaz,
    I have seen that you already received the help of tpl...
    I don't have time to do your homework as you requested (shell script)...
    But being an exercise for me, I solved your problem in python.
    Code:
    # rmlines.py
    
    # load text file
    f=open('xxx.txt')
    sF=f.read()   # all the file in one big string (sF)
    f.close()
    
    for n in range(10):    # n = 0, 1...9
        sC4 = str(n) * 4   # sC4 -> 0000, 1111, 2222 ...
        if sF.find(sC4) > -1:   # if you find a number with a repeated sequence
            sF = sF.replace(sC4, '~')    # replace the sequence everywhere in sF with tilde
    
    lF=sF.split('\n')    # generate a list of all the lines (w/out LF)
    
    lNew=[]   # init. new list w/out the wrong lines (those w/ prohibited numbers)
    for sLine in lF:
        if sLine.find('~') == -1:   # if the line doesn't contain tilde
            lNew.append(sLine)      # add one line to the new list
    
    fOut=open('yyy.txt','w')            # save new file in yyy.txt
    fOut.write('\n'.join(lNew))
    fOut.close()
    
    # run with python rmlines.py from the dir where there is xxx.txt
    
    # For ex. xxx.txt is:
    
    #    I'm new to shell scripts
    #    and I need one that can edit a large txt file.
    #    Every line in the file has a large number on it,
    #    and I need to remove lines containing any number 
    #    with 4 or more repeating characters anywhere
    #    within the number ie 123456789 would be fine
    #    but 111123456,
    #    or also
    #    023444456,
    #    and 018399991 all would not.
    #    I tried reading a guide on shell scripting,
    #    but its taking me too long to figure it out and I need this done for work asap.
    #    Any help is very appreciated.
    #    Sorry if my post and explanation sounds rushed, but it was.
    
    # you obtain yyy.txt:
    
    #    I'm new to shell scripts
    #    and I need one that can edit a large txt file.
    #    Every line in the file has a large number on it,
    #    and I need to remove lines containing any number 
    #    with 4 or more repeating characters anywhere
    #    within the number ie 123456789 would be fine
    #    or also
    #    I tried reading a guide on shell scripting,
    #    but its taking me too long to figure it out and I need this done for work asap.
    #    Any help is very appreciated.
    #    Sorry if my post and explanation sounds rushed, but it was.
    I always use python (instead of bash) whenever possible.
    Generally every Linux distribution has it (control by entering python in the shell).
    The script seems long,. but (taking away the comments) I think it has a length
    comparable to a shell script (also using awk).... and it is faster IMHO.

    Bye.

  5. #5
    Just Joined! otkaz's Avatar
    Join Date
    Feb 2009
    Location
    Houston, TX
    Posts
    19
    Quote Originally Posted by fiomba View Post
    Hi otkaz,
    I have seen that you already received the help of tpl...
    I don't have time to do your homework as you requested (shell script)...
    But being an exercise for me, I solved your problem in python.
    Code:
    # rmlines.py
    
    # load text file
    f=open('xxx.txt')
    sF=f.read()   # all the file in one big string (sF)
    f.close()
    
    for n in range(10):    # n = 0, 1...9
        sC4 = str(n) * 4   # sC4 -> 0000, 1111, 2222 ...
        if sF.find(sC4) > -1:   # if you find a number with a repeated sequence
            sF = sF.replace(sC4, '~')    # replace the sequence everywhere in sF with tilde
    
    lF=sF.split('\n')    # generate a list of all the lines (w/out LF)
    
    lNew=[]   # init. new list w/out the wrong lines (those w/ prohibited numbers)
    for sLine in lF:
        if sLine.find('~') == -1:   # if the line doesn't contain tilde
            lNew.append(sLine)      # add one line to the new list
    
    fOut=open('yyy.txt','w')            # save new file in yyy.txt
    fOut.write('\n'.join(lNew))
    fOut.close()
    
    # run with python rmlines.py from the dir where there is xxx.txt
    
    # For ex. xxx.txt is:
    
    #    I'm new to shell scripts
    #    and I need one that can edit a large txt file.
    #    Every line in the file has a large number on it,
    #    and I need to remove lines containing any number 
    #    with 4 or more repeating characters anywhere
    #    within the number ie 123456789 would be fine
    #    but 111123456,
    #    or also
    #    023444456,
    #    and 018399991 all would not.
    #    I tried reading a guide on shell scripting,
    #    but its taking me too long to figure it out and I need this done for work asap.
    #    Any help is very appreciated.
    #    Sorry if my post and explanation sounds rushed, but it was.
    
    # you obtain yyy.txt:
    
    #    I'm new to shell scripts
    #    and I need one that can edit a large txt file.
    #    Every line in the file has a large number on it,
    #    and I need to remove lines containing any number 
    #    with 4 or more repeating characters anywhere
    #    within the number ie 123456789 would be fine
    #    or also
    #    I tried reading a guide on shell scripting,
    #    but its taking me too long to figure it out and I need this done for work asap.
    #    Any help is very appreciated.
    #    Sorry if my post and explanation sounds rushed, but it was.
    I always use python (instead of bash) whenever possible.
    Generally every Linux distribution has it (control by entering python in the shell).
    The script seems long,. but (taking away the comments) I think it has a length
    comparable to a shell script (also using awk).... and it is faster IMHO.

    Bye.
    thanks!
    I'm sorry if I was rude asking for someone to do my work for me. This whole time I have been reading tuts and trying to figure it out myself. I'm just too new to shell scripts to figure it out in time to have this ready. You wont see me in here asking a question in this way again I just got cought in a pinch. I'll continue my reading and learn how to do this myself for the future. Thanks so much for the python script I'm going to go try it now.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...