Home » Tutorials » Parsing csv |
In this tutorial you will learn how to parse a simple CSV (Comma Separated
Values) file. This is the sort of file produced by spreadsheets and other
programs when a text-only, portable format is required for exporting data.
The file we will use in the examples below is a text file called
prospects.csv
, which could have been produced by the OpenOffice.org
'calc' spreadsheet.
"Name","Address","Floors","Donated last year","Contact" "Charlotte French Cakes","1179 Glenhuntly Rd",1,"Y","John" "Glenhuntly Pharmacy","1181 Glenhuntly Rd",1,"Y","Paul" "Dick Wicks Magnetic Pain Relief","1183-1185 Glenhuntly Rd",1,"Y","George" "Gilmour's Shoes","1187 Glenhuntly Rd",1,"Y","Ringo"
The format is the same as that produced by any program that claims to produce a valid CSV file, namely the field delimiter (the thing that separates the fields), is a comma, and textual data is surrounded by double quote characters. Numeric data is typically not quoted.
Note that data only has to be quoted if it contains anything special, like the separation character (the comma), the quotation character or anything binary, like a newline.
The Text::CSV module provides functions for both parsing and producing CSV data. It is essentially a wrapper module over a pure-perl parser Text::CSV_PP (bundled with Text::CSV) and the widely used Text::CSV_XS, which will be used if installed as it is written in perl/XS (a C dialect) and about 100 times faster.
We'll focus on the parsing functionality here. The following code sample opens the prospects.csv file and parses each line in turn, printing out all the fields it finds.
#!/usr/bin/perl use strict; use warnings; use Text::CSV; my $file = "prospects.csv"; open my $fh, "<", $file or die "$file: $!"; my $csv = Text::CSV->new ({ binary => 1, # Allow special character. Always set this auto_diag => 1, # Report irregularities immediately }); while (my $row = $csv->getline ($fh)) { print "@$row\n"; } close $fh;
Running the code produces the following output:
Name Address Floors Donated last year Contact Charlotte French Cakes 1179 Glenhuntly Rd 1 Y John Glenhuntly Pharmacy 1181 Glenhuntly Rd 1 Y Paul Dick Wicks Magnetic Pain Relief 1183-1185 Glenhuntly Rd 1 Y George Gilmour's Shoes 1187 Glenhuntly Rd 1 Y Ringo
And by replacing the line:
print "@$row\n";
with:
print "Name: $row->[0]\n\tContact: $row->[4]\n";
we can get more particular about which fields we want to output. And while we're at it, let's skip past the first line of our csv file, since it's only a list of column names.
my $csv = Text::CSV->new ({ binary => 1, auto_diag => 1 }); $csv->getline ($fh); # skip header while (my $row = $csv->getline ($fh)) { print "Name: $row->[0]\n\tContact: $row->[4]\n"; } close $fh;
Running this code will give us the following output:
Name: Charlotte French Cakes Contact: John Name: Glenhuntly Pharmacy Contact: Paul Name: Dick Wicks Magnetic Pain Relief Contact: George Name: Gilmour's Shoes Contact: Ringo
But we can make our lives still better. Why not use the header line to get named fields?
my $csv = Text::CSV->new ({ binary => 1, auto_diag => 1 }); $csv->column_names ($csv->getline ($fh)); # use header while (my $row = $csv->getline_hr ($fh)) { printf "Name: %-32s Contact: %s\n", $row->{Name}, $row->{Contact}; } close $fh;
Resulting in
Name: Charlotte French Cakes Contact: John Name: Glenhuntly Pharmacy Contact: Paul Name: Dick Wicks Magnetic Pain Relief Contact: George Name: Gilmour's Shoes Contact: Ringo
The Text::CSV modules offer a plethora of features to be able to parse the weirdest forms of CSV. Read their manual pages to see if it fits your needs.
If you'd prefer to access the fields in your CSV file by name, use
Tie:Handle::CSV (or the getline_hr ()
function of the Text::CSV
modules). Here is some code that displays the Contact and Address field
from each data row in the csv file.
#!/usr/bin/perl use strict; use warnings; use Tie::Handle::CSV; my $file = "prospects.csv"; my $fh = Tie::Handle::CSV->new ($file, header => 1); while (my $csv_line = <$fh>) { print $csv_line->{Contact} . ":\t" . $csv_line->{Address} . "\n"; } close $fh;
This produces:
John: 1179 Glenhuntly Rd Paul: 1181 Glenhuntly Rd George: 1183-1185 Glenhuntly Rd Ringo: 1187 Glenhuntly Rd
And now for a slightly surreal twist, (and a demonstration of some of the genuinely novel functionality you'll find on cpan), let's finish with an examination of the DBD::CSV module, which allows us to access data in a (set of) CSV file(s) with SQL statements. A significant subset of SQL is supported by this module by means of SQL::Statement, it's well worth looking at.
#!/usr/bin/perl use strict; use warnings; use DBI; # Connect to the database, (the directory containing our csv file(s)) my $dbh = DBI->connect ("dbi:CSV:", undef, undef, { f_dir => ".", f_ext => ".csv/r", f_enc => "utf-8", }); # Output the name and contact field from each row my $sth = $dbh->prepare ("select * from prospects where name like 'G%'"); $sth->execute; while (my $row = $sth->fetchrow_hashref) { print " name = $row->{name} \tcontact = $row->{contact}\n"; }
When run, the code above produces this output:
name = Glenhuntly Pharmacy contact = Paul name = Gilmour's Shoes contact = Ringo
Now for an example that manually overrides the columns names in our table with names of our choosing.
#!/usr/bin/perl use strict; use warnings; use DBI; my $dbh = DBI->connect ("dbi:CSV:", undef, undef, { f_dir => ".", f_ext => ".csv/r", f_enc => "utf-8", }); # Associate our csv file with the table name "prospects" and # manually declare (Dutch) names for each of the columns $dbh->{csv_tables}{klanten} = { file => "prospects.csv", col_names => [qw( naam adres verdiepingen gedoneerd contact )], }; my $sth = $dbh->prepare ("select * from klanten where naam like 'G%'"); $sth->execute; while (my $row = $sth->fetchrow_hashref) { print " name = $row->{naam} \tverdiepingen = $row->{verdiepingen}\n"; }
which produces:
naam = Glenhuntly Pharmacy verdiepingen = 1 naam = Gilmour's Shoes verdiepingen = 1
H.Merijn Brand (Tux), author/maintainer of Text::CSV_XS and DBD::CSV, has written a CSV checking utility called "csv-check" that comes with the Text::CSV_XS distribution (see the examples directory). It might help you analyzing not so obvious problems in CSV data.