Writing a Compile-Time CSV Parser in C++, Part 2: The Parser

Introduction

In part 1 of this article I described the constexpr feature, a C++ facility to perform compile-time evaluation.

In this part, I describe the how to write a library for compile-time (and runtime) parsing of CSV files called Csv::Parser:

Writing a CSV Parser

Why a Compile-Time Parser?

When I started writing the CSV parser library, I was focused on implementing a runtime parser in modern C++. However, while working on the code, I realized that almost all of the code I was writing could also be used in constexpr context!

Some of the benefits of having a compile-time parser are:

  • You can embed the CSV data in your program and have the values immediately available at runtime without additional computational overhead.
  • You can perform tests on the embedded data, verifying its integrity at compile-time.
  • You can test the parser code at compile-time without running any tests.
  • Due to constexpr nature of the parser, you can still use it at runtime.

The CSV Format

CSV (comma-separated values) format, while very common, has quite a few variations. The format is formally documented in RFC 4180. With many variations available, the RFC states that when processing CSV files,

Implementors should "be conservative in what you do, be liberal in what you accept from others".

The general format is defined as follows:

  • The record lines are delimited by a line break (CRLF). Implementations may use other line break formats (LF, CR). The line break on the last line is optional. Example:

    aaa,bbb,ccc CRLF
    zzz,yyy,xxx CRLF
    
  • Fields are separated by commas. There is no trailing comma on a record line. Space is considered a part of a field. Each line must contain the same number of fields.

  • Each field may be enclosed by double-quotes. The enclosing double-quotes are mandatory if the field contains line breaks, commas, or double-quotes.

  • Double-quotes are escaped by repeating them:

    "this field has "" double "" quotes",second field CRLF
    "this field has , comma and CRLF
    newline","last field"
    

Our parser fully supports the RFC 4180, while accepting other line break formats, ignoring whitespace around quoted values, and allowing escaped quotes inside unquoted fields. This is done to ensure the ability to load files from different implementations.

How to Represent Input and Output Data

C++17 added a new type std::string_view which refers to a constant contiguous sequence of characters. A typical implementation holds only two members: const char* (a pointer to the start of a data chunk) and a size. Plainly speaking, it's a section of a string-like sequence without mandatory terminating null. Luckily for us, std::string_view fully supports constexpr!

Unlike std::string, std::string_view does not own the data it points to, so great care is required to ensure that the pointed data is not destroyed while operating on std::string_view. Of course, in compile-time context, we can point std::string_view to a static data (embedded in binary file) so the input data lifetime is no longer an issue.

Having the std::string_view facility, we can define the input:

using namespace std::string_view_literals;

// The "sv" user-defined literal specifies a string_view
constexpr auto input = "ab,cd\r\nef,gh"sv;

But how do we represent the individual cells in parser output? We cannot use constexpr std::string in C++17 and it would also be inefficient to store all these strings again. Once again, we can use a collection of std::string_view objects pointing to the original input!

Depending on whether we want the result at compile-time or runtime, we can use two-dimensional either std::array, or std::vector as output.

Notice that I didn't mention what happens when we have to collapse duplicate double-quotes inside the field data. I will discuss a method of dealing with this issue later in the article.

Let's Start With Tests

Starting with a "write test first" approach, we should define how we would like to use the CSV parser in compile-time and runtime contexts.

Note that I chose to use "a collection of columns" style of representing the 2D data (as opposed to a "a collection of rows"), due to the fact that large numeric CSV data is usually represented this way.

Keeping constexpr specifics in mind, a simple compile-time interface could look like:

using namespace std::string_view_literals;

constexpr auto data = "ab,cd\r\nef,gh" sv;
constexpr std::size_t columns = 2, rows = 2;

constexpr Parser parser;
// parse into std::array<std::array<std::string_view, rows>, columns>
constexpr auto matrix = parser.parseTo2DArray<columns, rows>(data);

// Verify the data at compile time.
static_assert(matrix[0][0] == "ab"sv);
// ...

On the other hand, at runtime we should be able to use dynamic features as well:

using namespace std::string_view_literals;

auto data = "ab,cd\r\nef,gh" sv;

// Let "cell_refs" be a vector of columns.
// After parsing, each element will be referencing
// a part of the original data.
std::vector<std::vector<std::string_view>> cell_refs;

Parser parser;

try {
    // parseTo() throws ParseError on error.
    parser.parseTo(data, cell_refs);
}
catch(ParseError& ex) {
    // ... handle the error
}

assert(cell_refs[0][0] == "ab"sv);  // Runtime assertion

The Parser Function Interface

Since we need the parser to be able to operate on both std::vector (at runtime) and std::array (at compile-time), we begin by writing a generic parse() function that can be used in both contexts. To do this, we can use a callback to store the parsed results.

class Parser {
    public:

    /// Parse CSV string data and store the results using a callback
    /// function.
    /// Callback function signature:
    /// void func(std::size_t row, std::size_t column, std::string_view cell_data).
    /// \throws ParseError
    template<typename StoreCellFunction>
    constexpr void parse(std::string_view data, StoreCellFunction func) const
    {
        // Iterate over data character by character,
        // invoking func(row, column, cell_data) each time a cell
        // is fully parsed.
    }

    // ...
};

A few benefits of this approach are:

  • parse() does not depend on any particular output container type.
  • By throwing an exception on error, we make sure that in compile-time context, parse error triggers a compile-time error if the input data is invalid.

The actual parse() function implementation can be seen in csv_parser.h at GutHub

Implementing parseTo2DArray() and parseTo()

Now that we have a constexpr parse() function, we can use it to define user-friendly methods.

A simple Parser::parseTo2DArray() implementation would look like:

/// Parse CSV string into an array of columns.
/// This method wraps parseTo() to simplify compile-time parsing.
/// \return std::array<std::array<std::string_view, rows>, columns>
/// \throws ParseError
template<std::size_t columns, std::size_t rows>
constexpr auto Parser::parseTo2DArray(std::string_view data) const
{
    std::array<std::array<std::string_view, rows>, columns> matrix;

    parse(data,
        [&matrix](std::size_t row, std::size_t column,
            std::string_view cell_data) constexpr
        {
            matrix[column][row] = cell_data;
        }
    );

    return matrix;
}

Note that we cannot pass columns and rows as ordinary arguments. Even though we're in constexpr context, constexpr functions can still be called at runtime and passing runtime values as (compile-time) template arguments to std::array is an error.

As for dynamic version, Parser::parseTo() function accepting various container types would look like:

/// Parse CSV string into a vector of columns.
/// \tparam Vector2D accepts 2D containers such as
/// std::vector<std::vector<std::string_view>>.
/// \throws ParseError
template<typename Vector2D>
void Parser::parseTo(std::string_view data, Vector2D& matrix) const
{
    Vector2D parsed_matrix;

    parse(data,
        [&](std::size_t row, std::size_t column,
            std::string_view cell_data)
        {
            if (parsed_matrix.size() < (column + 1)) {
                parsed_matrix.resize(column + 1);
            }
            if (parsed_matrix[column].size() < (row + 1)) {
                parsed_matrix[column].resize(row + 1);
            }
            parsed_matrix[column][row] = cell_data;
        }
    );

    // To avoid partial result, only touch the "matrix" parameter
    // after everything is parsed.
    std::swap(matrix, parsed_matrix);
}

Notice the exception safety - we avoid touching matrix unless all the data is parsed.

Escaped Double-Quotes

I already discussed how we could use std::string_view to point to the original data. However not every field is simply a substring of the original CSV data, as the CSV format has a feature of escaping double-quotes by repeating them.

In runtime environment we could easily solve this problem by introducing a cleanup function like std::string collapseQuotes(std::string_view view). However, at least for now, std::string is not usable in constexpr world. A simple solution for this is to introduce a new CellStringBuffer<BufferSize> class, a fixed-size buffer to contain the cleaned-up cell value:

/// A helper class for compile-time buffer construction
/// and string cleanup
template<std::size_t Size>
class CellStringBuffer {
    public:

        /// Constructor. Creates a buffer with cleaned-up
        /// cell contents.
        constexpr explicit CellStringBuffer(std::string_view cell)
                : buffer_(prepareBuffer(cell))
        { }


        /// Return string_view to a stored buffer.
        /// The returned view has collapsed consecutive
        /// double-quotes.
        /// \throw std::out_of_range if buffer is of insufficient size
        [[nodiscard]] constexpr std::string_view getStringView() const
        {
            return {buffer_.buffer.data(), buffer_.size};
        }


    private:

        struct Buffer {
            std::array<char, Size> buffer = { };
            std::size_t size = 0;
        };


        /// Create a buffer object with cleaned-up input in it
        constexpr static Buffer prepareBuffer(std::string_view input)
        {
            if (Size < input.size()) {
                throw std::logic_error("Insufficient buffer size");
            }
            std::array<char, Size> buffer = { };
            std::size_t output_pos = 0;
            for (std::size_t input_pos = 0; input_pos < std::min(Size, input.size()); ++input_pos) {
                char c = input[input_pos];
                buffer[output_pos] = c;
                ++output_pos;
                if (c == '\"' && (input_pos + 1) < input.size() && input[input_pos + 1] == '\"') {
                    ++input_pos;
                }
            }
            return {buffer, output_pos};
        }


        Buffer buffer_;
};

Notice that we have to store the cleaned up string length because it may be shorter than the buffer size. Having this data, we can easily construct std::string_view pointing to the buffer contents.

To create a better interface, we can wrap the std::string_views returned by parseTo2DVector() into a new CellStringReference class and store it in std::array<std::array<CellStringReference, rows>, columns> instead.

The CellStringReference class would look like this:

/// A value of a cell, referencing the data in original CSV text.
class CellStringReference {
    public:

        /// Constructor
        constexpr CellStringReference() = default;

        /// Constructor
        constexpr CellStringReference(std::string_view cell) noexcept
                : value_(cell)
        { }

        /// Get stored cell reference as string_view.
        /// The returned data may contain escaped consecutive
        /// double-quotes.
        [[nodiscard]] constexpr std::string_view getOriginalStringView() const noexcept
        {
            return value_;
        }

        /// Get a string buffer with collapsed consecutive
        /// double-quotes.
        /// This function is useful in constexpr context to
        /// retrieve cleaned-up cell data.
        /// \tparam BufferSize buffer size, has to be at least
        /// getOriginalStringView().size().
        /// \throw std::logic_error if BufferSize is too small.
        template<std::size_t BufferSize>
        [[nodiscard]] constexpr CellStringBuffer<BufferSize> getCleanStringBuffer() const
        {
            return CellStringBuffer<BufferSize>(value_);
        }


    private:

        /// Stored data
        std::string_view value_;

};

Putting it all together, we can now remove the escaped quotes at compile-time:

using namespace std::string_view_literals;

constexpr std::string_view data = R"(ab""cd,efgh)"sv;
constexpr std::size_t columns = 2, rows = 1;

constexpr Csv::Parser parser;

// parse into std::array<std::array<CellStringReference, rows>, columns>
constexpr auto matrix = parser.parseTo2DArray<columns, rows>(data);

// Verify the data at compile-time.
static_assert(matrix[0][0].getOriginalStringView() == "ab\"\"cd"sv);
static_assert(matrix[0][0]
    .getCleanStringBuffer<"ab\"\"cd"sv.size()>()
    .getStringView() == "with \"quote inside"sv);

Future Improvements

As you can see, with some work, even string processing can be done at compile-time.

The above code could be improved further by detecting and parsing a string representation of a IEEE 754 double value at compile-time. Unfortunately, std::strtod() and its family of C library functions cannot be used in constexpr context, so a similar constexpr function would have to be implemented.

Perhaps, with time, all the common algorithms and functions we're used to in C++ and C library will have constexpr versions.

Useful Links