CSV Reading Engine - Roy Ratcliffe

Suppose we want to read a CSV file into Prolog terms, where the field names of the terms are derived from the CSV file’s header row. We can achieve this using a Prolog engine to read the file twice: first, read the header, then the data.

As an example, consider a CSV file data.csv with the following imaginary content:

Name,Age,Occupation
Alice,30,Mother
Bob,25,Designer
Charlie,35,Engineer

The goal is to read this file and produce Prolog term lists like:

[name('Alice'), age(30), occupation('Mother')].
[name('Bob'), age(25), occupation('Designer')].
[name('Charlie'), age(35), occupation('Engineer')].

Code the Solution

Read the CSV file once to get the header row; then subsequently repeat to get each data row, mapping the header columns to field names.

%!  csv_read_file_by_row(+Spec, -Row:list, +Options) is nondet.
%
%   Extracts records from a CSV file, using  the given read Options. The
%   resulting Row terms have fields named   after the CSV header columns
%   like an options list.
%
%   This  predicate  uses  a  Prolog  engine    to  read  the  CSV  file
%   non-deterministically, yielding one Row term  at   a  time.  This is
%   useful for processing large CSV  files   without  loading the entire
%   file into memory.
%
%   @arg Spec specifies the CSV file.
%   @arg Row is unified with each row.
%   @arg Options are passed to csv_read_file_row/3.

csv_read_file_by_row(Spec, Row, Options) :-
    absolute_file_name(Spec, Path, [extensions([csv])]),
    engine_create(Row1, csv_read_file_row(Path, Row1, Options), Engine),
    option(functor(Functor), Options, row),
    engine_next(Engine, Row0),
    Row0 =.. [Functor|Columns0],
    maplist(restyle_identifier(one_two), Columns0, Columns1),
    repeat,
    engine_next_reified(Engine, Term),
    (   Term = the(Row_)
    ->  Row_ =.. [Functor|Columns_],
        maplist(csv_read_file_by_row_, Columns1, Columns_, Row)
    ;   Term == no
    ->  !, fail
    ;   Term = throw(Error)
    ->  throw(Error)
    ).

csv_read_file_by_row_(Name, Value, Row) :- Row =.. [Name, Value].

Explanation

First, the predicate retrieves the absolute file name of the CSV file, ensuring that the provided file path has the .csv extension. Then, it creates an engine that reads the CSV file row by row, using the predicate csv_read_file_row/3. The engine will asynchronously yield each row as a term of the form row(Column1, Column2, ...).

Next, the predicate reads the first row from the engine, which is the header row. It extracts the column names from the header row and restyles them to be valid Prolog identifiers using restyle_identifier/3 in order to convert the header columns’ names to lowercase underscore-delimited Prolog atoms. Then the implementation finally enters a loop that reads each subsequent row from the engine.

As a result, the predicate reads each data row and maps to row terms non-deterministically, where the row terms have fields named after the header columns.

Repeat until no more rows

The predicate uses repeat to create a backtracking point, allowing it to non-deterministically yield each row to its caller. When it reaches the end of the file, it cuts and fails in order to stop the iteration. The cut removes the choice point created by repeat, ensuring that no further backtracking occurs.

Note that the predicate’s implementation uses engine_next_reified/2 to capture end-of-file and errors. Errors are rethrown in the caller’s context.

Unification vs identity

Note the difference between =/2 and ==/2 here; they are not the same. One tests unification, the other tests identity.

Implicit assumptions

The implementation makes assumptions. It presumes that the reified term is either: the(Row), no, or throw(Error). It does not allow for any other possibilities; thus delegating responsibility to the engine’s logic following the design-by-contract principle.

The implementation does not assume that the row functor is row/arity. Instead, it allows the user to specify a custom functor using the functor(Functor) option. If no functor is specified, it defaults to row.

In practice, the row functor does not matter. The predicate constructs terms with fields named after the header columns, regardless of the functor used. The functor is just a container for the fields, and its name does not affect the functionality of the predicate. Still, it is worth accounting for it in case the caller wants to override the functor.

Edge cases

The predicate fails if the CSV file contains no data rows. The same thing occurs if the CSV file has only a header row.

What if it has no header row? Then the first data row is treated as the header row, regardless, which is probably not what we want. This is a limitation of the above. Is there a way to detect this situation? Not easily, unless the use case has some prior knowledge of the expected columns.

Conclusions

Why does the predicate need an engine? We want to read the CSV file continuously: first for the header row, quietly, then once for each data row. The engine allows us to encapsulate the state of reading the file, so that we can read it multiple times. The engine maintains its own state, encapsulating the reading process, in which the context switches from reading the header row to reading the data rows.

Use of restyle_identifier/3 ensures that the header column names are converted to valid Prolog identifiers, preventing potential syntax errors when constructing the row terms. But it does make assumptions about the format of the header names. It assumes that they can be converted to valid Prolog atoms without conflicts. If the header names contain special characters or spaces, the conversion may not yield valid identifiers, leading to potential issues when constructing the row terms.

Overall, this implementation provides a flexible and efficient way to read CSV files into Prolog terms, leveraging the power of Prolog engines to manage the state of the file read.