Perl CSV column data extraction FAQ: Can you share an example of how to extract one or more columns from a Perl CSV file or other similarly-formatted flat text file?
Perl is a terrific language for text processing, but several readers have written wondering about how to extract columns of data from text files with Perl. For instance, when you have a text database that looks like this:
simpson bart springfield flinstone fred bedrock rubble barney bedrock
whats the best way to extract the last name from each database record?
When every record is separated by a delimiter (such as a ',
', ':
', or '|
' symbol) it’s easy to use the split
function. But in a case like this, where data is stored in fixed-width fields, and no field delimiter is used, how do you extract your data?
Perl’s substr function
Although you can’t jump right in and easily use the split
function, the solution is almost as easy, if not as obvious. To solve the “fixed-width field” problem, just use Perl’s substr
function.
Using our three line database above as an example, it appears that the fields of the database are defined like this:
Field Number Field Description Field Width 1 Last name 9 2 First name 9 3 City 11
Given these definitions, here’s how I’d grab the data from each record from a file named database:
open (IN, 'database'); while (<IN>) { chomp; $last_name = substr $_, 0, 9; # extract the last name field $first_name = substr $_, 9, 9; # extract the first name field $city = substr $_, 18, 11; # extract the city field # do something useful here with each value ... print "$first_name $last_name lives in $city\n"; } close (IN);
Perl substr function details
As you can see from this code fragment, the Perl substr
function is an important part of the solution. The substr
function lets you extract fields of information from strings. All you have to do is define three items:
- The name of the string
- The offset, or starting point
- The length of the substring to extract
In the example above, the name of the string we’re extracting information from is $_
— the default string when you’re reading from a data file like this. This string holds the value of each record that we’re reading from the database file.
The offset is really just the starting point of each field. In the sample database, the last_name
field begins in column 1, the first_name
field begins in column 10, and the city
field begins in column 19. Given these values, the offset for the fields are 0, 9, and 18, respectively. (Remember that Perl, like C, C++, Java, and other languages, begin counting at 0. Therefore, if a field starts in column 10, it’s really going to be the 9th element of a string; columns 1-9 of the database will be stored in the string elements numbered 0 through 8.)
The length of each substring is also defined by the database. In our example, the lengths of the three fields are the same as the field widths, 9, 9, and 11, respectively.
Summary: Perl column extraction
There are many other examples I could show here, but hopefully this Perl “column extraction” example is enough to get you started down the right path. Stated simply, the solution can be phrased like this:
- Read in each database record.
- Extract the fields you want from the record by using the Perl
substr
function. - Use the fields as desired.