This is the twelfth issue of the Data Awesome newsletter! Thank you for being a subscriber! đ You might notice this newsletter looks a bit different than previous ones. I made the switch from Mailchimp to Substack! Please reply to this message if you see anything wonky.
Data Awesome is back with a bunch of awesome stuff, just for you! Letâs get to it! đ
Awesome Articles
Markus Schmitt wrote a nice article comparing data orchestration tools: Airflow vs. Luigi vs. Argo vs. MLFlow vs. KubeFlow. He also has a helpful article on Scaling Pandas: Comparing Dask, Ray, Modin, Vaex, and RAPIDS. I find tool comparisons like these to be super valuable when deciding what to use and learn. đ
When looking at a new dataset I often make a correlation matrix of numeric columns using seabornâs heatmap. Hereâs an example image from the seaborn docs.
However, categorical features often get short shrift. If you want to evaluate how related your nominal or ordinal features are to each other and your target variable, check out An overview of correlation measures between categorical and continuous variables. You have a number of options for examining the relationships, depending upon your data types. đ
If you know SQL and want to get into data engineering, I suggest you check out Nicholas Chammasâs A Data Pipeline is a Materialized View. Reading it made me think about data pipelines differently. Hat tip to Conor Deweyâs Newsletter, which is where I found this article. I know Iâve recommended it before, but if you like Data Awesome, I think youâll like Conorâs Newsletter, too!
Awesome Visualization đŒ
Andrei Kashcha made the vs app shown above that creates a graph with Google search results via a nifty animation. Hereâs the app in action:
Donât you think thatâs pretty cool?
Awesome Package đ„
TabNet implements a deep neural network architecture that brings attention to tabular data problems. It appears to work quite well on many kinds of supervised learning problems. Tabnet PyTorch is a PyTorch implementation of the architecture.
Awesome Book & Podcastđ
Emily Robinson and Jacqueline Nollis wrote the excellent Build a Career in Data Science book. They have a podcast of the same name that I recently binged on. Itâs filled with great tips for folks in the field or looking to join. đ€
As a co-organizer of the Data Science DC Meetup, I am excited that Emily and Jacqueline will be recording a live episode of their podcast at our online Meetup on May 11, 2021. The topic is managing your manager. Come join us and bring your questions! đ
Awesome Trend in Cloud Computing âïž
Have you been hearing about serverless, and want to know what all the fuss is about? Hereâs the 30 second low-down. âł
Severless doesnât mean there is no server, it means you donât need to worry about the server. đ
Serverless is appealing because it can save you money. You pay per request rather than per minute. đ°
There are a lot of serverless options out there, including several offerings by AWS. Abhishek Rayâs LearnAWS blog has many helpful articles on AWS, including this guide: AWS Fargate Deep Dive: What it is, when to use it and comparison with AWS Lambda and ECS.
What Iâve been up to đ:
scikit-learnâs 0.24 update brings lots of great changes. Hereâs my guide to the highlights. đ
Reshaping NumPy arrays trips up lots of folks using machine learning libraries. Itâs not a glamorous topic, but my guide should save you some time and help you learn how to deal with shape errors. đșđ”
Memory is a big deal when using the pandas library. Run out and you have to take evasive maneuvers. đŻ Unfortunately, the memory reports can easily mislead you. Hereâs my guide to getting at the truth, plus what to do when your data wonât fit into memory.
I asked Python data folks on Twitter where they do most of their coding. Small sample size and all kinds of potential selection bias issues, but interesting nonetheless.
Thatâs all the awesomeness for now! Until next time, stay awesome data people! đ
Data Awesome #12
Great article!