Switching CSV libraries

Refactoring

Speeding it up

Julie Moronuki — Julie Moronuki

Contents

Setup
The types
Decoders
Generating the histogram
- Finding “Julie days”
- Generating ranges
- Counting tweets within a range
- Printing the histogram
A difference in our histograms

Video

57 minutes

Tags

For our four-year “anniversary” of becoming Twitter pals, I decided to see how this program looks written with a different library. I decided to use the sv library which didn’t even exist when Chris wrote the first version of this program.The sv library on Hackage. Twitter no longer provides your archives in CSV format, and I don’t have an archive of my own tweets that is old enough to be in CSV format, so I couldn’t analyze my own tweets with this program. For those reasons, I used Chris’s old tweet archive, and so my goal with this program was to produce a histogram that matched Chris’s original.

This is my first time writing a CSV-processing program. I had, of course, read Chris’s original code before I started writing this, but to be quite honest, I found it extremely difficult to read and understand. I have never been a Scala or Java programmer, and I don’t think in terms of one big main where all the action happens. It’s difficult for me to read such programs, and it’s nearly impossible for me to write them. So, what I have done here has ended up being quite different from his original program, and even fairly different from his refactored program. I didn’t read his refactored program before I wrote this, at least not until near the end, and I was surprised to find that, despite how different so much of our program looks, some of it looks exactly the same.

I was very happy with my decision to use the sv library here. It’s not really a CSV-parsing library; it’s a set of combinators and wrappers around a CSV-parsing library. It uses a library called hw-dsv The hw-dsv library on Hackage.for the parsing, and there is an sv-cassava The sv-cassava library on Hackage. package that provides the sv set of combinators and types but uses cassava for the parsing. As such, I’m not going to be discussing how the parsing gets done at all, instead focusing on using the sv package.

Setup

I started off the project with two modules. For a bigger project, I likely would have wanted more, but it’s not always clear to me how many I want from the start, so I usually divide things up later. However, at a minimum I want to follow Haskell’s example and keep my IO separate from my pure functions. I called my two modules Main and Parse, where Main has the main executable and imports Parse. The latter probably isn’t a very good name for this module, but it’s fine.

The Main module also contains a couple of supporting definitions for the main executable. The sv library supports other delimiter-separated value file types and can work with or without headers, so the first things I added to Main were those options, along with the necessary imports.

module Main where

import Parse

import System.Exit (exitFailure)

import qualified Data.Sv as SV
import qualified Data.Sv.Decode as D

import Data.ByteString (ByteString)


opts :: SV.ParseOptions
opts = SV.defaultParseOptions

The default parse options there are for comma-separated values with headers, which is perfect for the Twitter data we’re working with.

Next, I chose to define a variable for the filepath.

file :: FilePath
file = "chris__martin-2017-04.csv"

We’re not making that file available. Change the file path appropriately if you’re following along with some other Twitter data.

Then comes the main parsing function from sv: parseDecodeFromFile which, according to the documentation, loads a file, parses it, and decodes it. By decode they seem to mean the process of turning the parsed CSV into “a list of your Haskell datatype.” Although sv offers some other parse functions for different situations, this seems to have the basic functionality we’re looking for. It takes three arguments: a decoder, some parse options (defined above as opts), and a file path (defined above as file). It returns a m (DecodeValidation ByteString [a]); the m is constrained by MonadIO and I made it concrete as IO. So, my main parsing function looks like this.

readTweets :: IO (D.DecodeValidation ByteString [Tweet])
readTweets = SV.parseDecodeFromFile tweetsDecoder opts file

The decoder, here called tweetsDecoder, is something I have to provide, and it amounts to specific instructions for how to read each field into my Tweet datatype. I wrote that in the other module, along with the Tweet type that this will make a list of.

If you’re already familiar with the Validation type, then you may already wonder if DecodeValidation is a reference to that, and it is! DecodeValidation is a type synonym for Validation, so it shares the same Applicative instance.We have written about the Validation type and its Applicative previously. I love working with the Validation type, so I was pretty pleased with this. You can see in Chris’s writeup that the equivalent part of his program returns an Either, so in his main, he’s case matching on Right and Left, but since I’m using a Validation type, mine will have Success and Failure for its two cases. It ended up not having any practical ramifications, because I never ended up having any errors, but, nevertheless, I always appreciate that, if I did, Validation has the ability to tell me all of the errors in one error message, instead of only the first one it failed on.

OK, so my first iteration of main looked like this.

main :: IO ()
main =
  do
    tweetList <- readTweets
    case tweetList of
      SV.Success tweetList -> do
        formatHistogram (generateHistogram tweetList)   --these functions are in Parse
      SV.Failure e -> do
        putStrLn "Failed to parse and decode ze file:"
        print e
        exitFailure

The Failure case won’t ever need to change, I think; all it does is tell me there was a failure and print a list of the errors and then exit. The Success case, on the other hand, changed often as I worked through the program because I changed it each time to “test” various functions that I wrote. Once I had some basic decoding functions in place and the tweetsDecoder function working, I could, for example, print a tweet record by indexing into the list:

main :: IO ()
main =
  do
    tweetList <- readTweets
    case tweetList of
      SV.Success tweetList -> do
        print (tweetList !! 4)
        -- formatHistogram (generateHistogram tweetList)
      SV.Failure e -> do
        putStrLn "Failed to parse and decode ze file:"
        print e
        exitFailure

I do this a lot, and then usually run main in GHCi, because it helps me see what I’m doing. I need to see what the outputs of different steps look like; I need to see what I’m working with. So, while ghcid is useful and I do keep it running to keep the fast typechecking going, I also run main in GHCi a lot. In this case, I haven’t told you what the tweetsDecoder looks like yet, so for now it doesn’t work, but it will soon!

Join Type Classes