Find the answer to your Linux question:
Results 1 to 2 of 2
I am trying to use antiword to parse MSWord documents on the fly - a person uploads an MSWord document, script loads it into a temp dir so antiword can ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    Feb 2007
    Posts
    2

    Antiword parsing MSWord docs - code


    I am trying to use antiword to parse MSWord documents on the fly - a person uploads an MSWord document, script loads it into a temp dir so antiword can do character count, sends that count to another script that divides the count by 5 to arrive at "wordcount". I need content to include white spaces (up to 2 in a row allowed after a sentence endings (\.\!\?) and tabs. Nothing else.

    The MSWord docs might have pics, but I've told it to ignore those. The problem is that the count variances go from near perfect to +/- 5%, and I need closer to perfect.

    The portion of script for handling spaces is BOLDED, etc.:

    //Extract the files and count the words in each file.
    $ext = strtolower(substr($target, strrpos($target, '.')+1));
    if ($ext == 'zip' || $ext == 'doc' || $ext == 'txt') {
    $tmp_dir = 'temp_wordcount/'.$flname;
    mkdir($tmp_dir);
    if ($ext == 'zip') {
    $archive = new PclZip($target);
    $archive->extract(PCLZIP_OPT_PATH, $tmp_dir);
    } else {
    copy($target, $tmp_dir.'/'.$flname);
    }
    $total_chars = 0;
    if (is_dir($tmp_dir)) {
    if ($dh = opendir($tmp_dir)) {
    while (($file = readdir($dh)) !== false) {
    if (is_dir($tmp_dir.'/'.$file)) continue;
    $pos = strrpos($file, '.');
    if ($pos !== false) {
    $ext = strtolower(substr($file, $pos + 1));
    $content = '';
    switch($ext) {
    case 'doc':
    $fp = popen('antiword "'.$tmp_dir.'/'.$file.'" 2>&1', 'r');
    echo 'For file: '. $file .'<br/>';
    echo '=======================================<br/>';
    echo '<pre>';
    while (($line = fgets($fp)) !== false) {
    $content .= $line;
    echo $line;
    }
    echo '</pre>';
    echo '<br/>=======================================<br/>';

    break;
    case 'txt':
    $content = file_get_contents($tmp_dir.'/'.$file);
    break;
    }
    $content = str_replace('[pic]', '', $content);
    $content = preg_replace('/[\r\n\t]/', '', $content);
    $content = preg_replace('/([^\.\!\?"\'])[ ]+/', '$1', $content);
    $content = preg_replace('/\.[ ]{3,}/', '', $content);
    echo 'Total character count for '. $file.': '. strlen($content).'<br/>';
    $total_chars += strlen($content);
    }
    }
    closedir($dh);
    }
    }
    deltree($tmp_dir);
    }

    Would GREATLY appreciate any help - can't finish project without getting closer to perfect on counts!

  2. #2
    Just Joined!
    Join Date
    Feb 2007
    Posts
    2

    Red face This must be the wrong forum?

    As 9 people have looked, none answered, perhaps this is the wrong forum for antiword queries? Or did I just stumble on an unpopular little program? (The second guess wouldn't surprise me...)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •