Data Awesome #12
The latest tips and tools to help you rock all the data things!
This is the twelfth issue of the Data Awesome newsletter! Thank you for being a subscriber! 👏 You might notice this newsletter looks a bit different than previous ones. I made the switch from Mailchimp to Substack! Please reply to this message if you see anything wonky.
Data Awesome is back with a bunch of awesome stuff, just for you! Let’s get to it! 🚀
Markus Schmitt wrote a nice article comparing data orchestration tools: Airflow vs. Luigi vs. Argo vs. MLFlow vs. KubeFlow. He also has a helpful article on Scaling Pandas: Comparing Dask, Ray, Modin, Vaex, and RAPIDS. I find tool comparisons like these to be super valuable when deciding what to use and learn. 🎉
When looking at a new dataset I often make a correlation matrix of numeric columns using seaborn’s heatmap. Here’s an example image from the seaborn docs.
However, categorical features often get short shrift. If you want to evaluate how related your nominal or ordinal features are to each other and your target variable, check out An overview of correlation measures between categorical and continuous variables. You have a number of options for examining the relationships, depending upon your data types. 🔎
If you know SQL and want to get into data engineering, I suggest you check out Nicholas Chammas’s A Data Pipeline is a Materialized View. Reading it made me think about data pipelines differently. Hat tip to Conor Dewey’s Newsletter, which is where I found this article. I know I’ve recommended it before, but if you like Data Awesome, I think you’ll like Conor’s Newsletter, too!
Awesome Visualization 🖼
Don’t you think that’s pretty cool?
Awesome Package 🔥
TabNet implements a deep neural network architecture that brings attention to tabular data problems. It appears to work quite well on many kinds of supervised learning problems. Tabnet PyTorch is a PyTorch implementation of the architecture.
Awesome Book & Podcast🎙
Emily Robinson and Jacqueline Nollis wrote the excellent Build a Career in Data Science book. They have a podcast of the same name that I recently binged on. It’s filled with great tips for folks in the field or looking to join. 🎤
As a co-organizer of the Data Science DC Meetup, I am excited that Emily and Jacqueline will be recording a live episode of their podcast at our online Meetup on May 11, 2021. The topic is managing your manager. Come join us and bring your questions! 🎉
Awesome Trend in Cloud Computing ☁️
Have you been hearing about serverless, and want to know what all the fuss is about? Here’s the 30 second low-down. ⏳
Severless doesn’t mean there is no server, it means you don’t need to worry about the server. 😁
Serverless is appealing because it can save you money. You pay per request rather than per minute. 💰
There are a lot of serverless options out there, including several offerings by AWS. Abhishek Ray’s LearnAWS blog has many helpful articles on AWS, including this guide: AWS Fargate Deep Dive: What it is, when to use it and comparison with AWS Lambda and ECS.
What I’ve been up to 🖊:
scikit-learn’s 0.24 update brings lots of great changes. Here’s my guide to the highlights. 🎆
Reshaping NumPy arrays trips up lots of folks using machine learning libraries. It’s not a glamorous topic, but my guide should save you some time and help you learn how to deal with shape errors. 🔺🔵
Memory is a big deal when using the pandas library. Run out and you have to take evasive maneuvers. 😯 Unfortunately, the memory reports can easily mislead you. Here’s my guide to getting at the truth, plus what to do when your data won’t fit into memory.
I asked Python data folks on Twitter where they do most of their coding. Small sample size and all kinds of potential selection bias issues, but interesting nonetheless.
That’s all the awesomeness for now! Until next time, stay awesome data people! 🎉