Find the answer to your Linux question:
Results 1 to 7 of 7
I have to write an application as my "final test" in my faculty. My idea is to create a scientific paper automatic database, which should automatically extract all the information ...
  1. #1
    mij
    mij is offline
    Just Joined!
    Join Date
    Feb 2007
    Posts
    5

    Automatic paper classificator

    I have to write an application as my "final test" in my faculty.
    My idea is to create a scientific paper automatic database, which should automatically extract all the information from the paper that is needed to correctly index it (the abstract, title and author is enough) and classify it depending on the subject discussed on the paper (maths, medicine, physics, etc.), so the user just have to send a pdf file, and the system makes all the work.
    Classification using a naive Bayesian learning algorithm works quite O.K., but I cannot find any information on how to extract the title, abstract etc.
    Given the more or less fixed structured of scientific papers, I hope that I can use some kind of statistical learning algorithm for finding this elements of the text. Do you think it's possible? do you know about any similar project or paper on this subject?
    Thank you.

    P.S.: I apologize for my bad English. It used to be my second language, but after one year living in Italy I have a big mess on my head about languages!

  2. #2
    Linux Engineer wje_lf's Avatar
    Join Date
    Sep 2007
    Location
    Mariposa
    Posts
    1,192
    Given the more or less fixed structure of scientific papers, you can probably code by hand something which recognizes up to, say, a half dozen forms, and perform some sort of exceptional processing (outright rejection, formation of an exception list) for those rare papers which do not fit the predefined forms. That would probably be much simpler than throwing artificial intelligence at this particular tiny sub-problem. Simplicity is a virtue.
    --
    Bill

    Old age and treachery will overcome youth and skill.

  3. #3
    mij
    mij is offline
    Just Joined!
    Join Date
    Feb 2007
    Posts
    5
    This was my first approach, and it's detecting the abstract quite well: It's always one of the first paragraphs, it almost always contains the word "abstract", and it uses to be the paragraph just before the Introduction.
    Finding the title and the author's name is a different task, because even though both are allways on the first 10 lines of text or so, they don't follow any particular order, and I didn't find a way to tell them appart from, say, the name of the university or even from each other... any idea?

    However, I think you made a point when you said simplicity is a virtue. I get so excited developing the Bayesian learning stuff that maybe I'm trying to use it in places it just doesn't fit.

  4. #4
    Linux Engineer wje_lf's Avatar
    Join Date
    Sep 2007
    Location
    Mariposa
    Posts
    1,192
    I'm guessing that in 99% of the cases, the name of the institution will always contain a recognizable string such as universi or institut, and that this will apply no matter what language (English, German, Spanish, French, anything with a Roman alphabet or nearly Roman) is used.

    I'm also guessing that the author's name will contain fewer words than the title.

    Just guesses here.
    --
    Bill

    Old age and treachery will overcome youth and skill.

  5. #5
    mij
    mij is offline
    Just Joined!
    Join Date
    Feb 2007
    Posts
    5
    Just guesses are O.K., for I'm just guessing myself whether or not this project is feasible, and I appreciate every little bit of help anyone can provide.
    I found that in most papers the name of the institution is shown next to it's address -logically-, which can be easily located (in addition to using a stop list with words like "univ" "instit" or "inc" (because, in fact, most of the papers I work with are from private corporations and not universities).
    For telling apart the title and the author(s), I don't pick the shortest one as the author, for a paper written by various authors can easily have a longer author list than title. However, it's quite easy to guess which one is a list of names, and which is not.
    Being that said, the firsts tests I made this afternoon showed a quite low precision (2 in 5 tests failed), both of them retrieving useless information as the author. I know 5 tests are maybe not statistically significant, and I will be adding more rules to try to improve this. I'll let you know about the progress!

    Of course, any new ideas are welcome.

  6. #6
    mij
    mij is offline
    Just Joined!
    Join Date
    Feb 2007
    Posts
    5
    well, I promised to report any progress, so here it is:
    I tried my algorithm on 170 papers downloaded from arXiv.org e-Print archive and the precision was as low as 36% (61 / 170).
    I tried to improve this by using using a language detection algorithm. My guess was that as both the title and the abstract should show a greater correspondence with a language than other parts such as addresses. It showed some improvement, but the problems I found are just overwhelming.
    One of the most difficult problems is that when I convert the pdfs to text, format information is lost:
    Deutsches Elektronen-Synchrotron DESY, D-22603 Hamburg We perform a general analysis on the possibility of obtaining metastable vacua ...
    becomes
    Deutsches Elektronen-Synchrotron DESY, D-22603 Hamburg We perform a general analysis on the possibility of obtaining metastable vacua ...
    In the original pdf, it's clear that the bold, italic text conforms a different block than the normal text, but after using pdftotext it's just a single plain text line. How can I separate the merged address/abstract?
    I think that the only solutions would be to either:
    -create a new pdf to text conversion tool which separates blocks of different formatted text into different plaint text lines or
    -use some kind of language-analysis system.
    As both solutions are beyond my current knowledge, I'm leaving this project. Thank you for your kind help.

  7. #7
    Linux Engineer wje_lf's Avatar
    Join Date
    Sep 2007
    Location
    Mariposa
    Posts
    1,192
    One of the most difficult problems is that when I convert the pdfs to text, format information is lost
    Don't convert them to text. Convert them to PostScript. You'll keep all the formatting information, which will be in ASCII text. You'll probably be able to figure out what's what. Just know that PostScript doesn't work like this:
    Code:
    3 + 5
    but like this:
    Code:
    3 5 add
    because you specify the parameters before you specify the operation.

    More examples:
    Code:
    /Ariel findfont 12 scalefont setfont
    and:
    Code:
    /Ariel-Bold findfont 12 scalefont setfont
    and:
    Code:
    (The rain in Spain stays mainly in the plain) show
    and:
    Code:
    showpage
    (which means transfer to paper everything you've accumulated for this page, and start a new page).

    If you need more information on PostScript (you probably won't need it), google this:
    Code:
    postscript tutorial
    To convert a document from pdf to ps, use the pdf2ps utility. It's a bash script which uses the gs (ghostscript) program.

    If you don't already have the pdf2ps script on your system, here it is:
    Code:
    #!/bin/sh
    # $Id: pdf2ps,v 1.3 2002/04/23 11:58:35 easysw Exp $
    # Convert PDF to PostScript.
    
    OPTIONS=""
    while true
    do
            case "$1" in
            -?*) OPTIONS="$OPTIONS $1" ;;
            *)  break ;;
            esac
            shift
    done
    
    if [ $# -eq 2 ] 
    then
        outfile=$2
    elif [ $# -eq 1 ]
    then
        outfile=`basename "$1" \.pdf`.ps
    else
        echo "Usage: `basename $0` [-dASCII85EncodePages=false] [-dLanguageLevel=1|2|3] input.pdf [output.ps]" 1>&2
        exit 1
    fi
    
    # Doing an initial 'save' helps keep fonts from being flushed between pages.
    # We have to include the options twice because -I only takes effect if it
    # appears before other options.
    exec gs $OPTIONS -q -dNOPAUSE -dBATCH -dSAFER -sDEVICE=pswrite "-sOutputFile=$outfile" $OPTIONS -c save pop -f "$1"
    Hope this helps.
    --
    Bill

    Old age and treachery will overcome youth and skill.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...