Refactoring the CSV program

Tweet history project

Switching CSV libraries

Chris Martin — Chris Martin

Contents

Semigroup in Prelude
Parts of main
Process-and-print
Specialization
Where
Count and filter
Range type
A list of ranges
Process, then print
With input
Read, decode, fail
Day parsing
Crashing concern
Interpret, filter, group
Zipping ranges with numbers
Error aggregation
Failure values
Using the header
Final result

Video

20 videos, 126 minutes total

Semigroup in Prelude

This program was written in 2016 and it is now 2020, so I anticipate that it may need some small adjustments to bring it up to date with the latest libraries. Fortunately, it does all still compile.

The only thing I see when I load this code into GHCi with -Wall is one warning:

warning: [-Wunused-imports]
    The import of ‘Data.Monoid’ is redundant
   |
13 | import Data.Monoid ((<>))
   | ^^^^^^^^^^^^^^^^^^^^^^^^^

I had imported the <> operator because it was not yet in Prelude at the time. As of GHC 8.4 in 2018, <> is now in Prelude, so we can remove this import.

Parts of main

To be honest, I can’t immediately tell what this code is doing – I think the biggest problem is that nearly all of it is in one big main definition, which I attribute to the indiscretions of my youth. The first thing I do when I find something like this is start to break it up in to smaller definitions.

I do at least remember that this program follows a classic three-step pattern: read some stuff from a file, interpret the data, and print the results. So this is what I want main to look like:

main :: IO ()
main = readInput >>= processData >>= printOutput

Unfortunately, the program as I had written it doesn’t decompose this way. Look at what I had done:

    -- ...
    in  sequence_ $ do
            bin <- bins
            let count = julieDays
                      & mfilter (liftA2 (&&) (>= bin) (< nextBin bin))
                      & length
            return $ putStrLn $ show bin <> " " <> show count

This doesn’t ever produce a value that represents the output. Instead what we have here is an imperative-style loop that prints each line of output as it goes. The data processing and the output printing are intertwined. So I’m going to abandon this attempt to simplify main for the moment, and hope I can come back around to it eventually.

Process-and-print

I do think that everything that follows Right rows -> ... in the definition of main is begging to be written as its own top-level function.

main = do
    bs <- Bs.readFile "tweets.csv"
    let parsed = (Csv.decode Csv.HasHeader bs)
          :: Either String (Vector [Text])
    case parsed of
        Left err -> putStrLn err
        Right rows ->
            processDataAndPrintOutput rows

processDataAndPrintOutput rows =
    let julieDays = findJulieDays rows
        firstDay = minimum julieDays
    -- ...

I have given it an awkwardly long name to reflect my irritation that it does two things.

GHCi provides the type signature for the new function:

processDataAndPrintOutput ::
    (MonadPlus m, Foldable m) => m [Text] -> IO ()

But I’m going to simplify it because I know that m is the Vector of tweets that we get from parsing the CSV file.

processDataAndPrintOutput ::
    Vector [Text] -> IO ()

Join Type Classes