Suppose we want to read a CSV file into Prolog terms, where the field names of the terms are derived from the CSV file’s header row. We can achieve this using a Prolog engine to read the file twice: first, read the header, then the data.
As an example, consider a CSV file data.csv with the following
imaginary content:
Name,Age,Occupation
Alice,30,Mother
Bob,25,Designer
Charlie,35,Engineer
The goal is to read this file and produce Prolog term lists like:
[name('Alice'), age(30), occupation('Mother')].
[name('Bob'), age(25), occupation('Designer')].
[name('Charlie'), age(35), occupation('Engineer')].
Code the Solution
Read the CSV file once to get the header row; then subsequently repeat to get each data row, mapping the header columns to field names.
%! csv_read_file_by_row(+Spec, -Row:list, +Options) is nondet.
%
% Extracts records from a CSV file, using the given read Options. The
% resulting Row terms have fields named after the CSV header columns
% like an options list.
%
% This predicate uses a Prolog engine to read the CSV file
% non-deterministically, yielding one Row term at a time. This is
% useful for processing large CSV files without loading the entire
% file into memory.
%
% @arg Spec specifies the CSV file.
% @arg Row is unified with each row.
% @arg Options are passed to csv_read_file_row/3.
csv_read_file_by_row(Spec, Row, Options) :-
absolute_file_name(Spec, Path, [extensions([csv])]),
engine_create(Row1, csv_read_file_row(Path, Row1, Options), Engine),
option(functor(Functor), Options, row),
engine_next(Engine, Row0),
Row0 =.. [Functor|Columns0],
maplist(restyle_identifier(one_two), Columns0, Columns1),
repeat,
engine_next_reified(Engine, Term),
( Term = the(Row_)
-> Row_ =.. [Functor|Columns_],
maplist(csv_read_file_by_row_, Columns1, Columns_, Row)
; Term == no
-> !, fail
; Term = throw(Error)
-> throw(Error)
).
csv_read_file_by_row_(Name, Value, Row) :- Row =.. [Name, Value].
Explanation
First, the predicate retrieves the absolute file name of the CSV file,
ensuring that the provided file path has the .csv extension. Then, it
creates an engine that reads the CSV file row by row, using the
predicate csv_read_file_row/3. The engine will asynchronously yield
each row as a term of the form row(Column1, Column2, ...).
Next, the predicate reads the first row from the engine, which is the
header row. It extracts the column names from the header row and
restyles them to be valid Prolog identifiers using
restyle_identifier/3 in order to convert the header columns’ names to
lowercase underscore-delimited Prolog atoms. Then the implementation
finally enters a loop that reads each subsequent row from the engine.
As a result, the predicate reads each data row and maps to row terms non-deterministically, where the row terms have fields named after the header columns.
Repeat until no more rows
The predicate uses repeat to create a backtracking point, allowing it
to non-deterministically yield each row to its caller. When it reaches
the end of the file, it cuts and fails in order to stop the iteration.
The cut removes the choice point created by repeat, ensuring that no
further backtracking occurs.
Note that the predicate’s implementation uses engine_next_reified/2 to
capture end-of-file and errors. Errors are rethrown in the caller’s
context.
Unification vs identity
Note the difference between =/2 and ==/2 here; they are not the
same. One tests unification, the other tests identity.
Implicit assumptions
The implementation makes assumptions. It presumes that the reified term
is either: the(Row), no, or throw(Error). It does not allow for
any other possibilities; thus delegating responsibility to the engine’s
logic following the design-by-contract principle.
The implementation does not assume that the row functor is
row/arity. Instead, it allows the user to specify a custom functor
using the functor(Functor) option. If no functor is specified, it
defaults to row.
In practice, the row functor does not matter. The predicate constructs terms with fields named after the header columns, regardless of the functor used. The functor is just a container for the fields, and its name does not affect the functionality of the predicate. Still, it is worth accounting for it in case the caller wants to override the functor.
Edge cases
The predicate fails if the CSV file contains no data rows. The same thing occurs if the CSV file has only a header row.
What if it has no header row? Then the first data row is treated as the header row, regardless, which is probably not what we want. This is a limitation of the above. Is there a way to detect this situation? Not easily, unless the use case has some prior knowledge of the expected columns.
Conclusions
Why does the predicate need an engine? We want to read the CSV file continuously: first for the header row, quietly, then once for each data row. The engine allows us to encapsulate the state of reading the file, so that we can read it multiple times. The engine maintains its own state, encapsulating the reading process, in which the context switches from reading the header row to reading the data rows.
Use of restyle_identifier/3 ensures that the header column names are
converted to valid Prolog identifiers, preventing potential syntax
errors when constructing the row terms. But it does make assumptions
about the format of the header names. It assumes that they can be
converted to valid Prolog atoms without conflicts. If the header names
contain special characters or spaces, the conversion may not yield valid
identifiers, leading to potential issues when constructing the row
terms.
Overall, this implementation provides a flexible and efficient way to read CSV files into Prolog terms, leveraging the power of Prolog engines to manage the state of the file read.