Perl 'split' function - how to process text data files

Perl FAQ: How can I split a string in Perl, such as the strings in a pipe-delimited text file?

Many times you need a Perl script that can open a plain text file, and essentially treat that file as a database. Typically these files have variable-length fields and records, and the fields in each record are delimited by some special character, usually a : or | character. When processing these files, you can use the Perl split function, which I’ll demonstrate in two short programs here.

Perl split string - example #1

In this first “Perl split” example program, I’ll read all of the fields in each record into an array named @fields, and then I’ll show how to print out the first field from each row. This example shows several things, including how to split a record by the : character, which is the column delimiter in the Linux /etc/passwd file.

#!/usr/bin/perl

# perl split function example 1
# purpose:   read the /etc/passwd file, whose columns are separated by ':'
# usage:     perl read-passwd-file.pl

# sample /etc/passwd record:
# nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false

$filename = '/etc/passwd';

open(FILE, $filename) or die "Could not read from $filename, program halting.";
while(<FILE>)
{
  # get rid of the pesky newline character
  chomp;

  # read the fields in the current record into an array
  @fields = split(':', $_);

  # print the first field (the username)
  print "$fields[0]\n";
}
close FILE;

As you can see from that code, each field on each line is split by the : character, and I read each line into the @fields array, and then print the first field from each line with the $fields[0] variable.

Perl split string - example #2

This second Perl split example that shows how to process a text file with variable-length, delimited fields is almost identical to the first program. The only difference is the way I treat each line when I read it. Instead of reading each line into a Perl array, I treat it as a fixed set of variables. Because I know the Linux /etc/passwd file has exactly seven fields I can use this approach.

The format of the /etc/passwd file is well-known, so hopefully the variable names I use here will make sense to you. The “junk” variables represent fields that I don’t care about.

#!/usr/bin/perl

# perl split function example #2
# purpose:   read the /etc/passwd file, whose columns are separated by ':'
# usage:     perl read-passwd-file.pl

# sample /etc/passwd record:
# nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false

$filename = '/etc/passwd';

open(FILE, $filename) or die "Could not read from $filename, program halting.";
while(<FILE>)
{
  # get rid of the pesky newline character
  chomp;

  # read the fields in the current record as separate variables
  ($username,$junk1,$junk2,$junk3,$description,$home,$shell) = split(':', $_);

  # print the interesting fields
  print "$username, $description, $home, $shell\n";
}
close FILE;

The Perl split function delimiter character

As you can see from the Perl split function examples above, I split each record by using the : character as the field delimiter. This is what the /etc/passwd file uses as its delimiter, so my program also uses it. As mentioned, I’ve seen other file formats use the | character as a delimiter, and of course CSV files use the “,” character, and any of those characters can be specified with with split function; just replace the : shown above with the split character (delimiter) you need to use in your code.