The letter A styled as Alchemists logo. lchemists
Published February 20, 2022 Updated October 20, 2023
Cover
Ruby CSV Parsing

Comma-Separated Values (CSV) is a old data format that came into use several decades ago and is often used in low-tech solutions and/or legacy systems.

I’ve worked on too many teams who have used complicated implementations with nested loops to iterate through each CSV row, don’t leverage native CSV support for header key mapping, and so forth.

During the course of this article, I’ll walk you through how to make the changes necessary to parse CSVs coupled with improved error handling using Ruby’s native CSV gem and the Dry Schema gem. All of this will be done with minimal effort.

Quick Start

For those who would like to get started quickly, here’s the working implementation that this article will be delving into.

#! /usr/bin/env ruby
# frozen_string_literal: true

# Save as `snippet`, then `chmod 755 snippet`, and run as `./snippet`.

require "bundler/inline"

gemfile true do
  source "https://rubygems.org"

  gem "amazing_print"
  gem "debug"
  gem "dry-schema"
  gem "dry-monads"
  gem "refinements"
end

require "csv"

Dry::Schema.load_extensions :monads

include Dry::Monads[:result]

using Refinements::Hashes

Schema = Dry::Schema.Params do
  before(:key_coercer) { |result| result.to_h.symbolize_keys! }

  required(:body).array(:hash) do
    required(:book).filled(:string)
    required(:author).filled(:string)
    required(:price).filled(:float)
    required(:created_at).filled(:date_time)
  end
end

class Parser
  HEADERS = {
    "Book" => :book,
    "Author" => :author,
    "Price" => :price,
    "CreatedAt" => :created_at
  }.freeze

  def initialize schema: Schema, headers: HEADERS, client: CSV
    @schema = schema
    @headers = headers
    @client = client
  end

  def call(body) = schema.call(body: csv(body)).to_monad

  private

  attr_reader :schema, :headers, :client

  def csv body
    client.instance(body, headers: true, header_converters: proc { |key| headers[key] })
          .to_a
          .map(&:to_h)
  end
end

result = Parser.new.call <<~BODY
  Book,Author,Price,CreatedAt
  Mystics,urGoh,10.50,2022-01-01
  Skeksis,skekSil,20.75,2022-02-13
BODY

case result
  in Success(schema) then ap schema.to_h[:body]
  in Failure(schema) then ap schema.errors.to_h
end

When running the above script, you’ll get the following output:

[
  {
    :book => "Mystics",
    :author => "urGoh",
    :price => 10.5,
    :created_at => #<DateTime: 2022-01-01T00:00:00+00:00 ((2459581j,0s,0n),+0s,2299161j)>
  },
  {
    :book => "Skeksis",
    :author => "skekSil",
    :price => 20.75,
    :created_at => #<DateTime: 2022-02-13T00:00:00+00:00 ((2459624j,0s,0n),+0s,2299161j)>
  }
]

If you tweak the CSV body so it is malformed:

Book,Author,Price,CreatedAt
Mystics,,10.50,2022-01-01
Skeksis,skekSil,20.75,

…​then you’ll get the following errors when running the script:

{
  0 => {
    :author => [
      "must be filled"
    ]
  },
  1 => {
    :created_at => [
      "must be filled"
    ]
  }
}

That’s a lot of power with only a little bit of code but you might have questions about the implementation so let’s break this down next.

Breakdown

We’ll start at the top and work our way down.

Pragmas

#! /usr/bin/env ruby
# frozen_string_literal: true

Pragmas — also known as magic comments — ensures the script runs as a Ruby program and all strings are frozen for improved performance. You can learn more about pragmas via my Pragmater gem if you like.

Dependencies

Using a Bundler Inline script ensures dependencies are installed before the rest of the script executes. Definitely handy for small scripts like this but you can always use my Rubysmith gem if you need more firepower.

As for the dependencies, themselves, here are the details:

  • Amazing Print - I’m using this for pretty printing hashes at the end of the script via the ap message. I’ll touch upon this more later.

  • Debug - This is Ruby’s new debugger and is great for adding binding.break breakpoints to your code for debugging purposes.

  • Dry Schema - Provides a powerful DSL for analyzing and validating data structures. This is the primary power of this script and I will expand upon this further soon.

  • Dry Monads - Blends Functional Programming with our Object Oriented Design. This lends itself well for the pattern matching at the end of the script.

  • Refinements - This is my Ruby gem which refines core primitives and enhances the language without resorting to monkey patching.

Setup

Once our dependencies are installed, there is a tiny bit of setup required:

require "csv"

Dry::Schema.load_extensions :monads

include Dry::Monads[:result]

using Refinements::Hashes

First, you’ll need to require the CSV gem so you can parse CSV content. Next, teach Dry Schema to use monads — and include the result monad — so you can pattern match. Finally, you can use my Hash refinement so you can symbolize/coerce the schema keys since Ruby doesn’t have native support for key symbolization.

Schema

Now that we understand the dependencies used and the setup, we can talk about Dry Schema usage which is the heart of our solution:

Schema = Dry::Schema.Params do
  before(:key_coercer) { |result| result.to_h.symbolize_keys! }

  required(:body).array(:hash) do
    required(:book).filled(:string)
    required(:author).filled(:string)
    required(:price).filled(:float)
    required(:created_at).filled(:date_time)
  end
end

Dry Schema provides parameter and JSON schema support by default. The difference — even though we are dealing with a CSV — is what type coercion is used but I will let you read the Dry Schema documentation to learn more. I do want to point out that — with both Params and JSON — all keys are strings and is why I use my Refinements gem to coerce the keys as symbols within the key_coercer before block. I prefer using symbols as keys when possible.

Next up is the body of the CSV hash. Given the schema above, this equates to the following:

[
  {
    book: "Mystics",
    author: "urGoh",
    price: 10.5,
    created_at: #<DateTime: 2022-01-01T00:00:00+00:00 ((2459581j,0s,0n),+0s,2299161j)>
  }
]

Each element in the array is the CSV row as converted to a hash. I’m also expecting each CSV row to have certain columns which are:

  • book: Must be filled as a string.

  • author: Must be filled as a string.

  • price: Must be filled as a float.

  • created_at: Must be filled as a date/time.

Dry Schema makes it convenient to define what keys and values are required and also what you want the values coerced into. Normally, this would be additional work but is now avoided.

Parser

With our schema defined, we can move on to the second part of this puzzle which is our CSV parser. Here’s the code for review:

class Parser
  HEADERS = {
    "Book" => :book,
    "Author" => :author,
    "Price" => :price,
    "CreatedAt" => :created_at
  }.freeze

  def initialize schema: Schema, headers: HEADERS, client: CSV
    @schema = schema
    @headers = headers
    @client = client
  end

  def call(body) = schema.call(body: csv(body)).to_monad

  private

  attr_reader :schema, :headers, :client

  def csv body
    client.instance(body, headers: true, header_converters: proc { |key| headers[key] })
          .to_a
          .map(&:to_h)
  end
end

The core structure of this class is based on the Command and Barewords patterns which I’ve detailed before. Where things get interesting is with the initial parsing of the CSV and, later, when the body is consumed by the schema. Let’s start with the parsing of the CSV:

client.instance(body, headers: true, header_converters: proc { |key| headers[key] })
      .to_a
      .map(&:to_h)

Here you’re asking the CSV client to build a CSV instance with headers enabled. The headers are important because you’ll need them in order to build a row of key/value hash pairs which we can hand off to the schema. The last thing to do is assign a header_converters closure which knows how to translate each header key into a symbol which your schema will understand. This means that if your header is "Book" then it is looked up in the headers hash and translated to the :book symbol. Same goes for "Author" as :author and so forth. Here’s a line-by-line breakdown so you can see the evolution of the CSV object being transformed before being handed off to the schema:

client.instance(body, headers: true, header_converters: proc { |key| headers[key] })
      .tap { |object| puts object.inspect }
      .to_a
      .tap { |object| puts object.inspect }
      .map(&:to_h)
      .tap { |object| puts object.inspect }

The above will yield the following output but I’ve added comments to make each step output more clear:

# Step 1 - The CSV instance is initialized.
#<CSV io_type:StringIO encoding:UTF-8 lineno:0 col_sep:"," row_sep:"\n" quote_char:"\"" headers:true>

# Step 2 - The CSV instance is converted to an array of CSV rows.
[
  #<CSV::Row book:"Mystics" author:"urGoh" price:"10.50" created_at:"2022-01-01">,
  #<CSV::Row book:"Skeksis" author:"skekSil" price:"20.75" created_at:"2022-02-13">
]

# Step 3 - Each CSV row is converted into an array of hashes which Dry Schema can consume.
[
  {:book=>"Mystics", :author=>"urGoh", :price=>"10.50", :created_at=>"2022-01-01"},
  {:book=>"Skeksis", :author=>"skekSil", :price=>"20.75", :created_at=>"2022-02-13"}
]

With only a few lines of code, this is a lot of power the CSV gem gives you. 🎉

Now we can feed that information to our schema via this last line of code:

schema.call(body: csv(body)).to_monad

With the CSV parsed, all you have to do is message the schema with the CSV array and ask that the result be converted to a monad for pattern matching later.

Parsing

With parsing understood, now you can call it:

result = Parser.new.call <<~BODY
  Book,Author,Price,CreatedAt
  Mystics,urGoh,10.50,2022-01-01
  Skeksis,skekSil,20.75,2022-02-13
BODY

For illustration purposes, I’m inlining the CSV body via a heredoc but you could also message the parser with contents read from a file as well.

Pattern Matching

At this point, we are at the end of the script where you can pattern match on the result monad as follows:

case result
  in Success(schema) then ap schema.to_h[:body]
  in Failure(schema) then ap schema.errors.to_h[:body]
end

The benefit of having the schema answer a monad is that you’ll always know the result is either a Success or a Failure. That’s it.

Normally, you’d use the success or failure to process the result such as messaging an API client or updating the UI. Instead, I’m using Amazing Print (i.e. ap) to print out the success or failure result for illustration purposes. I’ll leave it up to you to wire up whatever downstream processing you’d need next. 🚀

Next Steps

During the course of this article, I’ve only been talking about Dry Schema as the primary solution but I want to highlight that if you need more error handling — or customized rules — you’ll want to reach for Dry Validation which is built on top of Dry Schema.

By the way, if it helps, both Dry Schema and Dry Validation are infinitely better than dealing with Active Model Validations or Action Controller Strong Parameters so the sooner you can move away from using them the happier and more efficient you’ll be.

Conclusion

I hope you’ve enjoyed learning how to parse CSVs — complete with error handling — with only a few lines of code which leverages the Command Pattern with a combination of native CSV and Dry Schema support.

Enjoy and may your CSV parsing implementations be fun to work with!