Refactoring the CSV program

Contents
  • Semigroup in Prelude
  • Parts of main
  • Process-and-print
  • Specialization
  • Where
  • Count and filter
  • Range type
  • A list of ranges
  • Process, then print
  • With input
  • Read, decode, fail
  • Day parsing
  • Crashing concern
  • Interpret, filter, group
  • Zipping ranges with numbers
  • Error aggregation
  • Failure values
  • Using the header
  • Final result
Video
  • 20 videos, 126 minutes total

The original program was one of the first things I ever produced in Haskell, and it looks very different from something I’d write today. In this lesson, I walk through the process of cleaning it up, thinking carefully about what makes a well-designed program.

A few of the general ideas covered here:

  • Splitting up long definitions
  • Separating pure functions from I/O
  • Avoiding the use of partial functions
  • Printing aggregated error information

Semigroup in Prelude

This program was written in 2016 and it is now 2020, so I anticipate that it may need some small adjustments to bring it up to date with the latest libraries. Fortunately, it does all still compile.

The only thing I see when I load this code into GHCi with -Wall is one warning:

warning: [-Wunused-imports]
    The import of ‘Data.Monoid’ is redundant
   |
13 | import Data.Monoid ((<>))
   | ^^^^^^^^^^^^^^^^^^^^^^^^^

I had imported the <> operator because it was not yet in Prelude at the time. As of GHC 8.4 in 2018, <> is now in Prelude, so we can remove this import.

Parts of main

To be honest, I can’t immediately tell what this code is doing – I think the biggest problem is that nearly all of it is in one big main definition, which I attribute to the indiscretions of my youth. The first thing I do when I find something like this is start to break it up in to smaller definitions.

I do at least remember that this program follows a classic three-step pattern: read some stuff from a file, interpret the data, and print the results. So this is what I want main to look like:

main :: IO ()
main = readInput >>= processData >>= printOutput

Unfortunately, the program as I had written it doesn’t decompose this way. Look at what I had done:

    -- ...
    in  sequence_ $ do
            bin <- bins
            let count = julieDays
                      & mfilter (liftA2 (&&) (>= bin) (< nextBin bin))
                      & length
            return $ putStrLn $ show bin <> " " <> show count

This doesn’t ever produce a value that represents the output. Instead what we have here is an imperative-style loop that prints each line of output as it goes. The data processing and the output printing are intertwined. So I’m going to abandon this attempt to simplify main for the moment, and hope I can come back around to it eventually.

Process-and-print

I do think that everything that follows Right rows -> ... in the definition of main is begging to be written as its own top-level function.

main = do
    bs <- Bs.readFile "tweets.csv"
    let parsed = (Csv.decode Csv.HasHeader bs)
          :: Either String (Vector [Text])
    case parsed of
        Left err -> putStrLn err
        Right rows ->
            processDataAndPrintOutput rows

processDataAndPrintOutput rows =
    let julieDays = findJulieDays rows
        firstDay = minimum julieDays
    -- ...

I have given it an awkwardly long name to reflect my irritation that it does two things.

GHCi provides the type signature for the new function:

processDataAndPrintOutput ::
    (MonadPlus m, Foldable m) => m [Text] -> IO ()

But I’m going to simplify it because I know that m is the Vector of tweets that we get from parsing the CSV file.

processDataAndPrintOutput ::
    Vector [Text] -> IO ()

Sign up for access to the full page, plus the complete archive and all the latest content.