- What can stream
- Tweet type
- The fold
- Viewing CSV rows as tweets
- Regrouping the histogram
- Printing output
Performance: How fast our code runs, and how much memory it uses. I mostly refrain from writing about this topic, because it tends to be not only a distraction, but a source of anxiety – you’ve done something and accomplished your goal, but the nagging voices of real or imagined Internet commenters are still telling you that your code isn’t good enough.
Performance anxiety can prevent you from doing what you enjoy and can affect your career. Stage Fright, WebMD
The trouble with performance is that the goal can be nebulous. The code must be faster, faster, faster! Like a life lived in pursuit of money, performance anxiety demands the question: How fast is fast enough? How much money is enough? When can I stop and be happy?
Our tweet histogram programs presented in the previous pages each take a few seconds to run. I suspect they could be rewritten to produce their results in an imperceptible amount of time, but I haven’t done this because a few seconds was plenty enough to stop and be happy with the result.
But I know that some of you are data scientists, and you have more than a few thousand tweets. You have genome sequences, astrological data, billions of financial transactions per second! You must accomplish tasks for which the speed is measured in hours, not seconds. This lesson is for you.
I’ll be starting with the Cassava-based histogram program and making two major changes.
- Regarding speed: I’ll be arranging the data into a
- Regarding memory: Instead of loading the whole file into memory at once, I’m going to consume the CSV file in a streaming way.
Most of what we write for Type Classes uses a minimal set of libraries and extensions, because we’re usually trying to focus on a single topic in a way that’s accessible to a broad audience. This lesson strays from that approach. To show what my everyday code looks like, the code presented below is written using all the tools I use when I’m not writing specifically for learners.
In particular, this episode introduces several libraries without our usual completeness of explanation. Readers who desire to follow along thoroughly will need to be comfortable reading some API documentation independently.