Results 1 to 7 of 7
I have to write an application as my "final test" in my faculty.
My idea is to create a scientific paper automatic database, which should automatically extract all the information ...
- 04-24-2008 #1Just Joined!
- Join Date
- Feb 2007
- Posts
- 5
Automatic paper classificator
I have to write an application as my "final test" in my faculty.
My idea is to create a scientific paper automatic database, which should automatically extract all the information from the paper that is needed to correctly index it (the abstract, title and author is enough) and classify it depending on the subject discussed on the paper (maths, medicine, physics, etc.), so the user just have to send a pdf file, and the system makes all the work.
Classification using a naive Bayesian learning algorithm works quite O.K., but I cannot find any information on how to extract the title, abstract etc.
Given the more or less fixed structured of scientific papers, I hope that I can use some kind of statistical learning algorithm for finding this elements of the text. Do you think it's possible? do you know about any similar project or paper on this subject?
Thank you.
P.S.: I apologize for my bad English. It used to be my second language, but after one year living in Italy I have a big mess on my head about languages!
- 04-24-2008 #2
Given the more or less fixed structure of scientific papers, you can probably code by hand something which recognizes up to, say, a half dozen forms, and perform some sort of exceptional processing (outright rejection, formation of an exception list) for those rare papers which do not fit the predefined forms. That would probably be much simpler than throwing artificial intelligence at this particular tiny sub-problem. Simplicity is a virtue.
--
Bill
Old age and treachery will overcome youth and skill.
- 04-24-2008 #3Just Joined!
- Join Date
- Feb 2007
- Posts
- 5
This was my first approach, and it's detecting the abstract quite well: It's always one of the first paragraphs, it almost always contains the word "abstract", and it uses to be the paragraph just before the Introduction.
Finding the title and the author's name is a different task, because even though both are allways on the first 10 lines of text or so, they don't follow any particular order, and I didn't find a way to tell them appart from, say, the name of the university or even from each other... any idea?
However, I think you made a point when you said simplicity is a virtue. I get so excited developing the Bayesian learning stuff that maybe I'm trying to use it in places it just doesn't fit.
- 04-24-2008 #4
I'm guessing that in 99% of the cases, the name of the institution will always contain a recognizable string such as universi or institut, and that this will apply no matter what language (English, German, Spanish, French, anything with a Roman alphabet or nearly Roman) is used.
I'm also guessing that the author's name will contain fewer words than the title.
Just guesses here.--
Bill
Old age and treachery will overcome youth and skill.
- 04-24-2008 #5Just Joined!
- Join Date
- Feb 2007
- Posts
- 5
Just guesses are O.K., for I'm just guessing myself whether or not this project is feasible, and I appreciate every little bit of help anyone can provide.
I found that in most papers the name of the institution is shown next to it's address -logically-, which can be easily located (in addition to using a stop list with words like "univ" "instit" or "inc" (because, in fact, most of the papers I work with are from private corporations and not universities).
For telling apart the title and the author(s), I don't pick the shortest one as the author, for a paper written by various authors can easily have a longer author list than title. However, it's quite easy to guess which one is a list of names, and which is not.
Being that said, the firsts tests I made this afternoon showed a quite low precision (2 in 5 tests failed), both of them retrieving useless information as the author. I know 5 tests are maybe not statistically significant, and I will be adding more rules to try to improve this. I'll let you know about the progress!
Of course, any new ideas are welcome.
- 04-27-2008 #6Just Joined!
- Join Date
- Feb 2007
- Posts
- 5
well, I promised to report any progress, so here it is:
I tried my algorithm on 170 papers downloaded from arXiv.org e-Print archive and the precision was as low as 36% (61 / 170).
I tried to improve this by using using a language detection algorithm. My guess was that as both the title and the abstract should show a greater correspondence with a language than other parts such as addresses. It showed some improvement, but the problems I found are just overwhelming.
One of the most difficult problems is that when I convert the pdfs to text, format information is lost:
becomesDeutsches Elektronen-Synchrotron DESY, D-22603 Hamburg We perform a general analysis on the possibility of obtaining metastable vacua ...
In the original pdf, it's clear that the bold, italic text conforms a different block than the normal text, but after using pdftotext it's just a single plain text line. How can I separate the merged address/abstract?Deutsches Elektronen-Synchrotron DESY, D-22603 Hamburg We perform a general analysis on the possibility of obtaining metastable vacua ...
I think that the only solutions would be to either:
-create a new pdf to text conversion tool which separates blocks of different formatted text into different plaint text lines or
-use some kind of language-analysis system.
As both solutions are beyond my current knowledge, I'm leaving this project. Thank you for your kind help.
- 04-27-2008 #7Don't convert them to text. Convert them to PostScript. You'll keep all the formatting information, which will be in ASCII text. You'll probably be able to figure out what's what. Just know that PostScript doesn't work like this:One of the most difficult problems is that when I convert the pdfs to text, format information is lost
but like this:Code:3 + 5
because you specify the parameters before you specify the operation.Code:3 5 add
More examples:
and:Code:/Ariel findfont 12 scalefont setfont
and:Code:/Ariel-Bold findfont 12 scalefont setfont
and:Code:(The rain in Spain stays mainly in the plain) show
(which means transfer to paper everything you've accumulated for this page, and start a new page).Code:showpage
If you need more information on PostScript (you probably won't need it), google this:
To convert a document from pdf to ps, use the pdf2ps utility. It's a bash script which uses the gs (ghostscript) program.Code:postscript tutorial
If you don't already have the pdf2ps script on your system, here it is:
Hope this helps.Code:#!/bin/sh # $Id: pdf2ps,v 1.3 2002/04/23 11:58:35 easysw Exp $ # Convert PDF to PostScript. OPTIONS="" while true do case "$1" in -?*) OPTIONS="$OPTIONS $1" ;; *) break ;; esac shift done if [ $# -eq 2 ] then outfile=$2 elif [ $# -eq 1 ] then outfile=`basename "$1" \.pdf`.ps else echo "Usage: `basename $0` [-dASCII85EncodePages=false] [-dLanguageLevel=1|2|3] input.pdf [output.ps]" 1>&2 exit 1 fi # Doing an initial 'save' helps keep fonts from being flushed between pages. # We have to include the options twice because -I only takes effect if it # appears before other options. exec gs $OPTIONS -q -dNOPAUSE -dBATCH -dSAFER -sDEVICE=pswrite "-sOutputFile=$outfile" $OPTIONS -c save pop -f "$1"--
Bill
Old age and treachery will overcome youth and skill.


Reply With Quote