Find the answer to your Linux question:
Results 1 to 8 of 8
HI! I am a researcher and I am trying to compare two sets of data that contain Russian names. I thought that I could use the grep command to accomplish ...
  1. #1
    bla
    bla is offline
    Just Joined!
    Join Date
    May 2007
    Posts
    3

    Need Help with Grep

    HI! I am a researcher and I am trying to compare two sets of data that contain Russian names. I thought that I could use the grep command to accomplish this quickly but whatever I am doing is not working. Here is what I have tried (and does not work!)
    Please, please help so I don't have use ctrl+f 4000 times!

    There are two folders: mydata (input files) newdata (files to be searched)
    All of the files in both folders are txt and the names are separated by carriage returns.

    I thought this was best:
    grep -if mydata/1865-1869.txt newdata/*.txt > output1.txt -n
    (but this gives me everyline in a file in newdata)

    Then I tried narrowing the grep to one file (not ideal)
    grep -if OurFiles/1865-1869.txt spse_data/1865-1869.txt > output2.txt
    (This ALSO gives me every line in the file grepped)

    The output looks like this in both cases:
    1:Rossisko-Amerikanskaia kompania
    2:St-Petersburgskoe Gazovoe obshchestvo
    3:Rossiskaya Bumagopriadilnia
    4:Obshchestvo Stolichnogo osveshchenia
    5:Obshchestvo St-Petersb. Vodoprov.
    6:Tsarevskaya manufakuta
    7:Sampsonievskaia Bumagopriadilnia
    8:Kompania dla hranenia i zaloga raznih dvigimostei i tovarov
    9:St-Petersburgskii Chast.Kommerch. Bank
    10:BAVARIA pivovaren. Zavod
    11:Ot ognia 1-go Rossiiskoe

    Thanks so much in advance for your help!! And please let me know if you need to know anything else.

  2. #2
    Linux User dxqcanada's Avatar
    Join Date
    Sep 2006
    Location
    Canada
    Posts
    259
    Can you describe what you are trying to achieve ?



    Men occasionally stumble over the truth,
    but most of them pick themselves up
    and hurry off as if nothing had happened.

    Winston Churchill


    ... then the Unix-Gods created "man" ...

  3. #3
    bla
    bla is offline
    Just Joined!
    Join Date
    May 2007
    Posts
    3
    Basically, I have two distinct sets of data about russian companies that I need to match. The togrep and input files are broken down by time periods and in separate folders. Because the information is both historical (1865+) and Russian, the only unique way to identify the companies is their names, which poses it own set of problems (misspelling, abbreviations, etc.). Because the data spans several decades, there is a lot of companies to compare, and I would like, because this is for research purposes, to have a more systematic way of capturing the matches than if I were to do it by hand.

    File contents look something like this with a unique name on each line:

    Kal'nikskogo sveklo-sakharnogo zavoda
    morskikh kupalen v Ekaterinentale, bliz Revelia
    Lodzinskoi fabrichnoi zheleznoi dorogi âS69, item 244ã
    S.-Peterburgskoi aktsionernoi pivovarni
    gorodskogo molochnogo khoziaistva
    pozemel'nogo kredita
    Voskresenskoi manufaktury
    Riazansko-Kozlovskoi zheleznoi dorogi
    Shlissel'burgskoi sitse-nabivnoi manufaktury
    gornykh zavodov Vsevolozhskikh

    Does that answer your question? I am open to any solution that you might know of - I am not married to grep.
    Thanks!!

  4. #4
    Linux Guru anomie's Avatar
    Join Date
    Mar 2005
    Location
    Texas
    Posts
    1,692
    Basically, I have two distinct sets of data about russian companies that I need to match.
    Still need more info to help.

    Here's what I understand:
    • You have data set #1 and data set #2. (Where 'data set' refers to a group of ascii text files.) Both contain the names of Russian companies.
    • You need to match every Russian company name in data set #1 with its corresponding Russian company name in data set #2.
    • You need to match using some level of intelligent logic, since Russian company names won't always be exactly alike.


    Is that correct?

    I still have more questions, and I can tell you this probably won't be a simple grep invocation from the command line. At the moment it sounds fairly complex.

  5. #5
    bla
    bla is offline
    Just Joined!
    Join Date
    May 2007
    Posts
    3
    You are right I did not make this totally clear.

    To address your list of understood things...

    * You have data set #1 and data set #2. (Where 'data set' refers to a group of ascii text files.) Both contain the names of Russian companies.

    ~YES!

    * You need to match every Russian company name in data set #1 with its corresponding Russian company name in data set #2.
    ~I do not believe there will be a one to one match of company names but I would like to compare every item in the data set #1 with #2 in the hopes of finding some matches.

    * You need to match using some level of intelligent logic, since Russian company names won't always be exactly alike.
    ~This is ideal but even looking for exact matches would be a start.

    Thanks for following up.

  6. #6
    drl
    drl is offline
    Linux Engineer drl's Avatar
    Join Date
    Apr 2006
    Location
    Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
    Posts
    1,117
    Hi.

    I'm not sure I understand the problem, either.

    However, a tool that may be of some value for small spelling discrepancies is:
    agrep - search a file for a string or regular expression, with approxi-
    mate matching capabilities

    -- excerpt form man agrep
    Best wishes ... cheers, drl
    Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
    90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
    We look forward to helping you with the challenge of the other 10%.
    ( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )

  7. #7
    Linux Guru anomie's Avatar
    Join Date
    Mar 2005
    Location
    Texas
    Posts
    1,692
    Quote Originally Posted by bla View Post
    ~This is ideal but even looking for exact matches would be a start.
    Here's a start, then, looking for exact matches.

    First, here's a look at my test data.
    Code:
    [helen@troy sandbox]$ ls russian-authors/
    file1  file2
    
    [helen@troy sandbox]$ ls mystery-data/
    data  more-data
    
    [helen@troy sandbox]$ cat russian-authors/file1
    Some Guy
    Another Russian Guy
    
    [helen@troy sandbox]$ cat russian-authors/file2
    Another Author
    Russian Writer Guy
    Some Other Person
    
    [helen@troy sandbox]$ cat mystery-data/data 
    Some Guy
    Nothing
    Nobody
    
    [helen@troy sandbox]$ cat mystery-data/more-data 
    Another Russian Guy
    Nothing
    Else
    To
    See Here
    It's intended to be a tiny representation of how you've described your own data. There is a directory 'russian-authors' which contains author names (i.e. a complete name on a single line). The search will be driven from 'russian-authors'. There is a second directory called 'mystery-data', which contains some instances of matches, and some other spurious data.

    For every author name (i.e. single line) that lives in an ascii text file in 'russian-authors', we'll expect to see a match in 'mystery-data' in some cases, and in other cases no match.

    Now for the implementation:
    Code:
    [helen@troy sandbox]$ cat russian-authors/* | xargs -i grep -R {} mystery-data/
    mystery-data/data:Some Guy
    mystery-data/more-data:Another Russian Guy
    And the results match our expectations. The 'Some Guy' author existed in 'russian-authors' and we found a match in 'mystery-data'. The grep program even told us which exact file he found it in ('data'). Likewise for the author 'Another Russian Guy'.

    Hopefully this simple example will give you an idea on one way to proceed. Once you're ready to do some non-exact matching, you'll have some more research to do with agrep or with regular expressions.

  8. #8
    Linux Newbie
    Join Date
    Sep 2005
    Location
    CZ
    Posts
    164

    Talking

    Hello,
    do you need to use a command line tool to achieve your task? Looks like you could use a spreadsheet application (like Excel) and fill your data in some columns. Then probably sort the columns in some way and look if that would be useful to you...
    There's a hitch as you described your task:
    The company's name could be "St. Petersburg oil" in one sheet and "S. Petersburg, oil company", so I don't think you will have the match by any computer... Then it could be that your name would be like "Oil, St. Petersburg", so looking completely different to a machine.

    I think with the use of a spreadsheet application you can analyze your task precisely and you will see what you can expect. Good luck, fellow researcher

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...