Results 1 to 2 of 2
I am trying to use antiword to parse MSWord documents on the fly - a person uploads an MSWord document, script loads it into a temp dir so antiword can ...
- 02-18-2007 #1Just Joined!
- Join Date
- Feb 2007
- Posts
- 2
Antiword parsing MSWord docs - code
I am trying to use antiword to parse MSWord documents on the fly - a person uploads an MSWord document, script loads it into a temp dir so antiword can do character count, sends that count to another script that divides the count by 5 to arrive at "wordcount". I need content to include white spaces (up to 2 in a row allowed after a sentence endings (\.\!\?) and tabs. Nothing else.
The MSWord docs might have pics, but I've told it to ignore those. The problem is that the count variances go from near perfect to +/- 5%, and I need closer to perfect.
The portion of script for handling spaces is BOLDED, etc.:
//Extract the files and count the words in each file.
$ext = strtolower(substr($target, strrpos($target, '.')+1));
if ($ext == 'zip' || $ext == 'doc' || $ext == 'txt') {
$tmp_dir = 'temp_wordcount/'.$flname;
mkdir($tmp_dir);
if ($ext == 'zip') {
$archive = new PclZip($target);
$archive->extract(PCLZIP_OPT_PATH, $tmp_dir);
} else {
copy($target, $tmp_dir.'/'.$flname);
}
$total_chars = 0;
if (is_dir($tmp_dir)) {
if ($dh = opendir($tmp_dir)) {
while (($file = readdir($dh)) !== false) {
if (is_dir($tmp_dir.'/'.$file)) continue;
$pos = strrpos($file, '.');
if ($pos !== false) {
$ext = strtolower(substr($file, $pos + 1));
$content = '';
switch($ext) {
case 'doc':
$fp = popen('antiword "'.$tmp_dir.'/'.$file.'" 2>&1', 'r');
echo 'For file: '. $file .'<br/>';
echo '=======================================<br/>';
echo '<pre>';
while (($line = fgets($fp)) !== false) {
$content .= $line;
echo $line;
}
echo '</pre>';
echo '<br/>=======================================<br/>';
break;
case 'txt':
$content = file_get_contents($tmp_dir.'/'.$file);
break;
}
$content = str_replace('[pic]', '', $content);
$content = preg_replace('/[\r\n\t]/', '', $content);
$content = preg_replace('/([^\.\!\?"\'])[ ]+/', '$1', $content);
$content = preg_replace('/\.[ ]{3,}/', '', $content);
echo 'Total character count for '. $file.': '. strlen($content).'<br/>';
$total_chars += strlen($content);
} }
closedir($dh);
}
}
deltree($tmp_dir);
}
Would GREATLY appreciate any help - can't finish project without getting closer to perfect on counts!
- 02-19-2007 #2Just Joined!
- Join Date
- Feb 2007
- Posts
- 2
This must be the wrong forum?
As 9 people have looked, none answered, perhaps this is the wrong forum for antiword queries? Or did I just stumble on an unpopular little program? (The second guess wouldn't surprise me...)


Reply With Quote