Problem: You’re developing a Perl program, and you need to process every “word” in a text file within your program.
Solution: How you achieve this in the end depends on your meaning of “every word,” but I’m going to go with a very simple definition, where I can use the Perl split
function to break up each “word” that is surrounded by whitespace characters.
Here’s the source code for a Perl program that reads its input from STDIN (perl stdin); uses the Perl split
function to split each input line into a group of words; loops through those words using a Perl for
loop; and finally prints each word from within the for
loop:
#!/usr/bin/perl # # purpose: this is a perl program that demonstrates # how to read file contents from STDIN (perl stdin), # use the perl split function to split each line in # the file into a list of words, and then print each word. # # usage: perl this-program.pl < input-file # read from perl stdin while (<>) { # split each input line; words are separated by whitespace for $word (split) { # do whatever you need to here. in my case # i'm just printing each "word" on a new line. print $word . "\n"; } }
As mentioned above, this Perl program reads from STDIN (standard input), so the script should be run like this:
perl this-program.pl < input-file
Or, if you make the file executable on a Linux or Unix system using chmod, you can run this Perl script like this:
this-program.pl < input-file
Sample output (from our Perl split and stdin example)
When I run this Perl script against a text file that contains the contents of the Gettysburg Address, and then use the Unix head
command to show the first 30 lines of output, I get these results:
prompt> perl process-every-word-file.pl < gettysburg-address | head -30 Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.
As you can see, when you use the Perl split
function and split a line using whitespace characters, some “words” can end up containing other characters, like commas or periods. You can strip those characters out with regular expression patterns, but for now I’m out of time, and I’m going to have to leave that as an exercise for the reader.