The art of reducing pandas memory usage without losing data

When you work with large datasets, you run the risk of running out of memory.

An out of memory error is particularly frustrating because your programme suddenly crashes after you’ve patiently waited for it to load all of your data just to reach the point where it can’t load any more.

Game over message, representing the frustration of running out of memory.
Out of memory errors can involve a lot of waiting only to find out your programme has crashed. Photo by Sigmund on Unsplash.

Fortunately, there are plenty of best practices when working with Python and Pandas to overcome this hurdle, not least within this excellent reference by Itamar Turner-Trauring.

This article focuses on the simple but effective technique of changing the data types of your pandas DataFrame to make it more memory…

Get to know the techniques they don’t teach in the textbooks

When it comes to running a clustering/segmentation project, one of the most challenging tasks is determining how many clusters exist.

The good news is that there are plenty of statistical techniques to try and answer that question, ranging from the elbow method to t-SNE visualisation to the gap statistic.

The bad news is that these techniques are rarely conclusive. The reason machine learning courses use examples such as the Iris flower data set is that the number of clusters is known in advance, and they are quite easy to find.

When you finish studying and start working as a data…

How to get creative with your hypothesis tests

Hypothesis testing has been around for decades, with well-established methods of determining whether the results that have been observed are significant. Yet sometimes it can be easy to lose track of which testing approach to use or whether it can be reliably applied to your situation.

A real-world use case

Companies often use surveys to track their NPS (Net Promoter Score), a measure that’s designed to reflect customer loyalty and potential for growth.

The calculation is simple, ask customers how likely they are to recommend your business on a scale of 0–10, then subtract the percentage of negative responses from the percentage of positive.

How NPS is calculated
NPS calculation. Image by author.

A personal reflection on keeping you and your team happy and productive

Do you ever wish that you could just manage a clone of yourself? Messages would never get lost in translation, you would know exactly what work your line report is capable of, and they would find your jokes hilarious.

Two identical twin toddlers wearing matching outfits
Photo by frank mckenna on Unsplash

But we can’t do that, and notwithstanding an inflated opinion of myself, it’s not particularly desirable. Teams instead work best when they have a diversity of background and thinking. Plus, it can be a useful exercise in humility to know that your manager is probably having the same thoughts.

So in this clone-less world, you need to embrace the challenges and…

Some errors are more costly than others; the way your model learns should reflect that

George Orwell’s novella Animal Farm includes the memorable line…

all animals are equal, but some animals are more equal than others ¹

Orwell may have been referring to hypocrisy, power, and privilege in society, but if you replace the word animals with errors, it starts to become very relevant to machine learning.

Now that I’ve finished pretending to be well-read, let’s get more specific.

Get to know your errors

Explaining the concept of false positives and negatives is a popular interview question because they are so important when applying a classification algorithm in practice.

I still sometimes find myself hastily consulting Wikipedia just before a…


