Using Perl to extract specific text from multiple files

So after I finish this book I will need to review it again. I will try to get 80% or more retention before I sit the test. To help me do that I am following the review spread sheet provided and will do the mind-map activities.

Another step I will add is to use my vocabulary app to help remember the key terms from the text. They are provided at the end of each chapter:
keywords.png

I have a html copy of this book I read on my tablet so all I need todo is write a perl script like this:

use Modern::Perl;
use HTML::Strip;
#set the relevant utf8 modes
binmode (STDOUT, ':utf8');
binmode (STDIN, ':utf8');
my $MYFILE;
open ($MYFILE, ">>ccnawordlist.txt") or die "Error opening";

my $strip = HTML::Strip->new();

my $line;
#read all files from argv or stdin
while (<<>>) {
        $line .= $_;
}
#this html file  wordwraps to 72 charaters. 
#\R removes UTF8 newlines and the wierd = formatting thing.
$line =~ s/=\R//g;
#match everything between these phrases
my @matches = $line =~ /Key Terms You Should Know(.*?)Command References/msg or die "error";

for my $match  (@matches) {
	#strip all html formatting
        my $cleantext = $strip->parse( $match);
        $strip->eof;
        #$cleantext =~ s/=\R//g;
        print $MYFILE $cleantext;
}

 

Now this code DOES’T work. It’s because of a bug in perl I think as the string
gets too large perl stops matching what’s in the regexp and matches everything!

So I wrote this shell scripts to run it on a single file at a time.

#/bin/sh
IFS=$'\n'
for file in ./*.mhtml; do
        perl build.pl  $file;
done

outputlog.png
I will try and see if the original program runs on the latest version of perl and
if the bug exists I will report it.

So at the end of it I only have 400 or so words to work through!.
ccnawordlist

But it just goes to show how a little bit of knowledge of a programming language
can help.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s