Tweet

Parsing CSV files

In this tutorial you will learn how to parse a simple CSV (Comma Separated Values) file. This is the sort of file produced by spreadsheets and other programs when a text-only, portable format is required for exporting data. The file we will use in the examples below is a text file called prospects.csv, which could have been produced by the OpenOffice.org 'calc' spreadsheet.

  "Name","Address","Floors","Donated last year","Contact"
  "Charlotte French Cakes","1179 Glenhuntly Rd",1,"Y","John"
  "Glenhuntly Pharmacy","1181 Glenhuntly Rd",1,"Y","Paul"
  "Dick Wicks Magnetic Pain Relief","1183-1185 Glenhuntly Rd",1,"Y","George"
  "Gilmour's Shoes","1187 Glenhuntly Rd",1,"Y","Ringo"

The format is the same as that produced by any program that claims to produce a valid CSV file, namely the field delimiter (the thing that separates the fields), is a comma, and textual data is surrounded by double quote characters. Numeric data is typically not quoted.

Note that data only has to be quoted if it contains anything special, like the separation character (the comma), the quotation character or anything binary, like a newline.

Text::CSV_XS and Text::CSV

The Text::CSV module provides functions for both parsing and producing CSV data. It is essentially a wrapper module over a pure-perl parser Text::CSV_PP (bundled with Text::CSV) and the widely used Text::CSV_XS, which will be used if installed as it is written in perl/XS (a C dialect) and about 100 times faster.

We'll focus on the parsing functionality here. The following code sample opens the prospects.csv file and parses each line in turn, printing out all the fields it finds.

  #!/usr/bin/perl

  use strict;
  use warnings;
  use Text::CSV;

  my $file = "prospects.csv";
  open my $fh, "<", $file or die "$file: $!";

  my $csv = Text::CSV->new ({
      binary    => 1, # Allow special character. Always set this
      auto_diag => 1, # Report irregularities immediately
      });
  while (my $row = $csv->getline ($fh)) {
      print "@$row\n";
      }
  close $fh;

Running the code produces the following output:

  Name Address Floors Donated last year Contact
  Charlotte French Cakes 1179 Glenhuntly Rd 1 Y John
  Glenhuntly Pharmacy 1181 Glenhuntly Rd 1 Y Paul
  Dick Wicks Magnetic Pain Relief 1183-1185 Glenhuntly Rd 1 Y George
  Gilmour's Shoes 1187 Glenhuntly Rd 1 Y Ringo

And by replacing the line:

      print "@$row\n";

with:

      print "Name: $row->[0]\n\tContact: $row->[4]\n";

we can get more particular about which fields we want to output. And while we're at it, let's skip past the first line of our csv file, since it's only a list of column names.

  my $csv = Text::CSV->new ({ binary => 1, auto_diag => 1 });
  $csv->getline ($fh); # skip header
  while (my $row = $csv->getline ($fh)) {
      print "Name:  $row->[0]\n\tContact: $row->[4]\n";
      }
  close $fh;

Running this code will give us the following output:

  Name: Charlotte French Cakes
	  Contact: John
  Name: Glenhuntly Pharmacy
	  Contact: Paul
  Name: Dick Wicks Magnetic Pain Relief
	  Contact: George
  Name: Gilmour's Shoes
	  Contact: Ringo

But we can make our lives still better. Why not use the header line to get named fields?

  my $csv = Text::CSV->new ({ binary => 1, auto_diag => 1 });
  $csv->column_names ($csv->getline ($fh)); # use header
  while (my $row = $csv->getline_hr ($fh)) {
      printf "Name: %-32s Contact: %s\n",
          $row->{Name}, $row->{Contact};
      }
  close $fh;

Resulting in

  Name: Charlotte French Cakes           Contact: John
  Name: Glenhuntly Pharmacy              Contact: Paul
  Name: Dick Wicks Magnetic Pain Relief  Contact: George
  Name: Gilmour's Shoes                  Contact: Ringo

The Text::CSV modules offer a plethora of features to be able to parse the weirdest forms of CSV. Read their manual pages to see if it fits your needs.

Tie::Handle::CSV

If you'd prefer to access the fields in your CSV file by name, use Tie:Handle::CSV (or the getline_hr () function of the Text::CSV modules). Here is some code that displays the Contact and Address field from each data row in the csv file.

  #!/usr/bin/perl

  use strict;
  use warnings;
  use Tie::Handle::CSV;

  my $file = "prospects.csv";
  my $fh = Tie::Handle::CSV->new ($file, header => 1);

  while (my $csv_line = <$fh>) {
      print $csv_line->{Contact} . ":\t" . $csv_line->{Address} . "\n";
      }

  close $fh;

This produces:

  John:   1179 Glenhuntly Rd
  Paul:   1181 Glenhuntly Rd
  George: 1183-1185 Glenhuntly Rd
  Ringo:  1187 Glenhuntly Rd

DBD::CSV

And now for a slightly surreal twist, (and a demonstration of some of the genuinely novel functionality you'll find on cpan), let's finish with an examination of the DBD::CSV module, which allows us to access data in a (set of) CSV file(s) with SQL statements. A significant subset of SQL is supported by this module by means of SQL::Statement, it's well worth looking at.

  #!/usr/bin/perl

  use strict;
  use warnings;
  use DBI;

  # Connect to the database, (the directory containing our csv file(s))
  my $dbh = DBI->connect ("dbi:CSV:", undef, undef, {
      f_dir   => ".",
      f_ext   => ".csv/r",
      f_enc   => "utf-8",
      });

  # Output the name and contact field from each row
  my $sth = $dbh->prepare ("select * from prospects where name like 'G%'");
  $sth->execute;
  while (my $row = $sth->fetchrow_hashref) {
      print "  name = $row->{name} \tcontact = $row->{contact}\n";
      }

When run, the code above produces this output:

  name = Glenhuntly Pharmacy    contact = Paul
  name = Gilmour's Shoes        contact = Ringo

Now for an example that manually overrides the columns names in our table with names of our choosing.

  #!/usr/bin/perl
  use strict;
  use warnings;
  use DBI;

  my $dbh = DBI->connect ("dbi:CSV:", undef, undef, {
      f_dir   => ".",
      f_ext   => ".csv/r",
      f_enc   => "utf-8",
      });

  # Associate our csv file with the table name "prospects" and
  # manually declare (Dutch) names for each of the columns
  $dbh->{csv_tables}{klanten} = {
      file      => "prospects.csv",
      col_names => [qw( naam adres verdiepingen gedoneerd contact )],
      };

  my $sth = $dbh->prepare ("select * from klanten where naam like 'G%'");
  $sth->execute;
  while (my $row = $sth->fetchrow_hashref) {
      print "  name = $row->{naam} \tverdiepingen = $row->{verdiepingen}\n";
      }

which produces:

  naam = Glenhuntly Pharmacy    verdiepingen = 1
  naam = Gilmour's Shoes        verdiepingen = 1

References

H.Merijn Brand (Tux), author/maintainer of Text::CSV_XS and DBD::CSV, has written a CSV checking utility called "csv-check" that comes with the Text::CSV_XS distribution (see the examples directory). It might help you analyzing not so obvious problems in CSV data.

See also

Text::CSV_XS
See the DESCRIPTION section at the start of the file for a working description of the comma separated values format. Several pitfalls are described where parsing CSV might go wrong.
Tie::Handle::CSV
"Easy access to CSV files". This module is a front end to Text::CSV_XS.
Text::CSV::Simple
"Simpler parsing of CSV files". Also a front end to Text::CSV_XS.
Parse::CSV
"Highly flexible CSV parser for large files". Also a front end to Text::CSV_XS.

[Top]