This is the twelfth issue of the Data Awesome newsletter! Thank you for being a subscriber! š You might notice this newsletter looks a bit different than previous ones. I made the switch from Mailchimp to Substack! Please reply to this message if you see anything wonky.
Data Awesome is back with a bunch of awesome stuff, just for you! Letās get to it! š
Awesome Articles
Markus Schmitt wrote a nice article comparing data orchestration tools: Airflow vs. Luigi vs. Argo vs. MLFlow vs. KubeFlow. He also has a helpful article on Scaling Pandas: Comparing Dask, Ray, Modin, Vaex, and RAPIDS. I find tool comparisons like these to be super valuable when deciding what to use and learn. š
When looking at a new dataset I often make a correlation matrix of numeric columns using seabornās heatmap. Hereās an example image from the seaborn docs.
However, categorical features often get short shrift. If you want to evaluate how related your nominal or ordinal features are to each other and your target variable, check out An overview of correlation measures between categorical and continuous variables. You have a number of options for examining the relationships, depending upon your data types. š
If you know SQL and want to get into data engineering, I suggest you check out Nicholas Chammasās A Data Pipeline is a Materialized View. Reading it made me think about data pipelines differently. Hat tip to Conor Deweyās Newsletter, which is where I found this article. I know Iāve recommended it before, but if you like Data Awesome, I think youāll like Conorās Newsletter, too!
Awesome Visualization š¼
Andrei Kashcha made the vs app shown above that creates a graph with Google search results via a nifty animation. Hereās the app in action:
Donāt you think thatās pretty cool?
Awesome Package š„
TabNet implements a deep neural network architecture that brings attention to tabular data problems. It appears to work quite well on many kinds of supervised learning problems. Tabnet PyTorch is a PyTorch implementation of the architecture.
Awesome Book & Podcastš
Emily Robinson and Jacqueline Nollis wrote the excellent Build a Career in Data Science book. They have a podcast of the same name that I recently binged on. Itās filled with great tips for folks in the field or looking to join. š¤
As a co-organizer of the Data Science DC Meetup, I am excited that Emily and Jacqueline will be recording a live episode of their podcast at our online Meetup on May 11, 2021. The topic is managing your manager. Come join us and bring your questions! š
Awesome Trend in Cloud Computing āļø
Have you been hearing about serverless, and want to know what all the fuss is about? Hereās the 30 second low-down. ā³
Severless doesnāt mean there is no server, it means you donāt need to worry about the server. š
Serverless is appealing because it can save you money. You pay per request rather than per minute. š°
There are a lot of serverless options out there, including several offerings by AWS. Abhishek Rayās LearnAWS blog has many helpful articles on AWS, including this guide: AWS Fargate Deep Dive: What it is, when to use it and comparison with AWS Lambda and ECS.
What Iāve been up to š:
scikit-learnās 0.24 update brings lots of great changes. Hereās my guide to the highlights. š
Reshaping NumPy arrays trips up lots of folks using machine learning libraries. Itās not a glamorous topic, but my guide should save you some time and help you learn how to deal with shape errors. šŗšµ
Memory is a big deal when using the pandas library. Run out and you have to take evasive maneuvers. šÆ Unfortunately, the memory reports can easily mislead you. Hereās my guide to getting at the truth, plus what to do when your data wonāt fit into memory.
I asked Python data folks on Twitter where they do most of their coding. Small sample size and all kinds of potential selection bias issues, but interesting nonetheless.
Thatās all the awesomeness for now! Until next time, stay awesome data people! š
Great article!