Perl CSV file column extraction

Perl CSV column data extraction FAQ: Can you share an example of how to extract one or more columns from a Perl CSV file or other similarly-formatted flat text file?

Perl is a terrific language for text processing, but several readers have written wondering about how to extract columns of data from text files with Perl. For instance, when you have a text database that looks like this:

simpson		bart	springfield
flinstone	fred	bedrock
rubble		barney	bedrock 

whats the best way to extract the last name from each database record?

When every record is separated by a delimiter (such as a ',', ':', or '|' symbol) it’s easy to use the split function. But in a case like this, where data is stored in fixed-width fields, and no field delimiter is used, how do you extract your data?

Perl’s substr function

Although you can’t jump right in and easily use the split function, the solution is almost as easy, if not as obvious. To solve the “fixed-width field” problem, just use Perl’s substr function.

Using our three line database above as an example, it appears that the fields of the database are defined like this:

Field Number	Field Description	Field Width 
1 	Last name	9
2 	First name	9
3 	City	11

Given these definitions, here’s how I’d grab the data from each record from a file named database:

open (IN, 'database');

while (<IN>) {
    chomp;
    $last_name = substr $_, 0, 9;     # extract the last name field
    $first_name = substr $_, 9, 9;    # extract the first name field
    $city = substr $_, 18, 11;        # extract the city field
    # do something useful here with each value ...
    print "$first_name $last_name lives in $city\n";
}

close (IN);

Perl substr function details

As you can see from this code fragment, the Perl substr function is an important part of the solution. The substr function lets you extract fields of information from strings. All you have to do is define three items:

  1. The name of the string
  2. The offset, or starting point
  3. The length of the substring to extract

In the example above, the name of the string we’re extracting information from is $_ — the default string when you’re reading from a data file like this. This string holds the value of each record that we’re reading from the database file.

The offset is really just the starting point of each field. In the sample database, the last_name field begins in column 1, the first_name field begins in column 10, and the city field begins in column 19. Given these values, the offset for the fields are 0, 9, and 18, respectively. (Remember that Perl, like C, C++, Java, and other languages, begin counting at 0. Therefore, if a field starts in column 10, it’s really going to be the 9th element of a string; columns 1-9 of the database will be stored in the string elements numbered 0 through 8.)

The length of each substring is also defined by the database. In our example, the lengths of the three fields are the same as the field widths, 9, 9, and 11, respectively.

Summary: Perl column extraction

There are many other examples I could show here, but hopefully this Perl “column extraction” example is enough to get you started down the right path. Stated simply, the solution can be phrased like this:

  1. Read in each database record.
  2. Extract the fields you want from the record by using the Perl substr function.
  3. Use the fields as desired.