Results 1 to 8 of 8
HI! I am a researcher and I am trying to compare two sets of data that contain Russian names. I thought that I could use the grep command to accomplish ...
- 05-04-2007 #1Just Joined!
- Join Date
- May 2007
- Posts
- 3
Need Help with Grep
HI! I am a researcher and I am trying to compare two sets of data that contain Russian names. I thought that I could use the grep command to accomplish this quickly but whatever I am doing is not working. Here is what I have tried (and does not work!)
Please, please help so I don't have use ctrl+f 4000 times!
There are two folders: mydata (input files) newdata (files to be searched)
All of the files in both folders are txt and the names are separated by carriage returns.
I thought this was best:
grep -if mydata/1865-1869.txt newdata/*.txt > output1.txt -n
(but this gives me everyline in a file in newdata)
Then I tried narrowing the grep to one file (not ideal)
grep -if OurFiles/1865-1869.txt spse_data/1865-1869.txt > output2.txt
(This ALSO gives me every line in the file grepped)
The output looks like this in both cases:
1:Rossisko-Amerikanskaia kompania
2:St-Petersburgskoe Gazovoe obshchestvo
3:Rossiskaya Bumagopriadilnia
4:Obshchestvo Stolichnogo osveshchenia
5:Obshchestvo St-Petersb. Vodoprov.
6:Tsarevskaya manufakuta
7:Sampsonievskaia Bumagopriadilnia
8:Kompania dla hranenia i zaloga raznih dvigimostei i tovarov
9:St-Petersburgskii Chast.Kommerch. Bank
10:BAVARIA pivovaren. Zavod
11:Ot ognia 1-go Rossiiskoe
Thanks so much in advance for your help!! And please let me know if you need to know anything else.
- 05-04-2007 #2
Can you describe what you are trying to achieve ?
Men occasionally stumble over the truth,
but most of them pick themselves up
and hurry off as if nothing had happened.
Winston Churchill
... then the Unix-Gods created "man" ...
- 05-04-2007 #3Just Joined!
- Join Date
- May 2007
- Posts
- 3
Basically, I have two distinct sets of data about russian companies that I need to match. The togrep and input files are broken down by time periods and in separate folders. Because the information is both historical (1865+) and Russian, the only unique way to identify the companies is their names, which poses it own set of problems (misspelling, abbreviations, etc.). Because the data spans several decades, there is a lot of companies to compare, and I would like, because this is for research purposes, to have a more systematic way of capturing the matches than if I were to do it by hand.
File contents look something like this with a unique name on each line:
Kal'nikskogo sveklo-sakharnogo zavoda
morskikh kupalen v Ekaterinentale, bliz Revelia
Lodzinskoi fabrichnoi zheleznoi dorogi âS69, item 244ã
S.-Peterburgskoi aktsionernoi pivovarni
gorodskogo molochnogo khoziaistva
pozemel'nogo kredita
Voskresenskoi manufaktury
Riazansko-Kozlovskoi zheleznoi dorogi
Shlissel'burgskoi sitse-nabivnoi manufaktury
gornykh zavodov Vsevolozhskikh
Does that answer your question? I am open to any solution that you might know of - I am not married to grep.
Thanks!!
- 05-04-2007 #4Still need more info to help.Basically, I have two distinct sets of data about russian companies that I need to match.
Here's what I understand:
- You have data set #1 and data set #2. (Where 'data set' refers to a group of ascii text files.) Both contain the names of Russian companies.
- You need to match every Russian company name in data set #1 with its corresponding Russian company name in data set #2.
- You need to match using some level of intelligent logic, since Russian company names won't always be exactly alike.
Is that correct?
I still have more questions, and I can tell you this probably won't be a simple grep invocation from the command line. At the moment it sounds fairly complex.
- 05-04-2007 #5Just Joined!
- Join Date
- May 2007
- Posts
- 3
You are right I did not make this totally clear.
To address your list of understood things...
* You have data set #1 and data set #2. (Where 'data set' refers to a group of ascii text files.) Both contain the names of Russian companies.
~YES!
* You need to match every Russian company name in data set #1 with its corresponding Russian company name in data set #2.
~I do not believe there will be a one to one match of company names but I would like to compare every item in the data set #1 with #2 in the hopes of finding some matches.
* You need to match using some level of intelligent logic, since Russian company names won't always be exactly alike.
~This is ideal but even looking for exact matches would be a start.
Thanks for following up.
- 05-04-2007 #6Linux Engineer
- Join Date
- Apr 2006
- Location
- Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
- Posts
- 1,117
Hi.
I'm not sure I understand the problem, either.
However, a tool that may be of some value for small spelling discrepancies is:
Best wishes ... cheers, drlagrep - search a file for a string or regular expression, with approxi-
mate matching capabilities
-- excerpt form man agrepWelcome - get the most out of the forum by reading forum basics and guidelines: click here.
90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
We look forward to helping you with the challenge of the other 10%.
( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )
- 05-05-2007 #7
Here's a start, then, looking for exact matches.
First, here's a look at my test data.
It's intended to be a tiny representation of how you've described your own data. There is a directory 'russian-authors' which contains author names (i.e. a complete name on a single line). The search will be driven from 'russian-authors'. There is a second directory called 'mystery-data', which contains some instances of matches, and some other spurious data.Code:[helen@troy sandbox]$ ls russian-authors/ file1 file2 [helen@troy sandbox]$ ls mystery-data/ data more-data [helen@troy sandbox]$ cat russian-authors/file1 Some Guy Another Russian Guy [helen@troy sandbox]$ cat russian-authors/file2 Another Author Russian Writer Guy Some Other Person [helen@troy sandbox]$ cat mystery-data/data Some Guy Nothing Nobody [helen@troy sandbox]$ cat mystery-data/more-data Another Russian Guy Nothing Else To See Here
For every author name (i.e. single line) that lives in an ascii text file in 'russian-authors', we'll expect to see a match in 'mystery-data' in some cases, and in other cases no match.
Now for the implementation:
And the results match our expectations. The 'Some Guy' author existed in 'russian-authors' and we found a match in 'mystery-data'. The grep program even told us which exact file he found it in ('data'). Likewise for the author 'Another Russian Guy'.Code:[helen@troy sandbox]$ cat russian-authors/* | xargs -i grep -R {} mystery-data/ mystery-data/data:Some Guy mystery-data/more-data:Another Russian Guy
Hopefully this simple example will give you an idea on one way to proceed. Once you're ready to do some non-exact matching, you'll have some more research to do with agrep or with regular expressions.
- 05-05-2007 #8Linux Newbie
- Join Date
- Sep 2005
- Location
- CZ
- Posts
- 164
Hello,
do you need to use a command line tool to achieve your task? Looks like you could use a spreadsheet application (like Excel) and fill your data in some columns. Then probably sort the columns in some way and look if that would be useful to you...
There's a hitch as you described your task:
The company's name could be "St. Petersburg oil" in one sheet and "S. Petersburg, oil company", so I don't think you will have the match by any computer... Then it could be that your name would be like "Oil, St. Petersburg", so looking completely different to a machine.
I think with the use of a spreadsheet application you can analyze your task precisely and you will see what you can expect. Good luck, fellow researcher


Reply With Quote
