Data Awesome #15
The SQL Murder Mystery, M6 competition, BERTopic and other great resources for data people
Welcome to Data Awesome, the newsletter where I share awesome resources for data folks and sound like a mix of a grandpa and a tech-bro. 😉 Let’s get to it! 🚀
Awesome game to work on your SQL skills ⬆️
The SQL Murder Mystery is fun SQL practice for folks who want reps joining and filtering data. I co-organize the Data Science DC Meetup and this cool tool up at a recent gathering.
Awesome Article 🖋
There are many ways to measure distance in data land. Different measures come in handy in different data science contexts. This lovely discussion of 9 Distance Measures in Data Science by Maarten Grootendorst helps bring order to the space. 🚀
Awesome Packages 📦
Maarten is also the author of the awesome BERTopic Python package. BERTopic uses the BERT - Bidirectional Encoder Representations from Transformers so you can create several types of topic models. The package provides handy visualizations, too! BERTopic works nicely with Hugging Face transformers and popular NLP libraries such as spaCy and Gensim. By the way, if you are into NLP and haven’t checked out spaCy, I highly suggest you give it a whirl. I taught a lesson on the awesome library yesterday. 🙂
The AutoScraper package by Alireza Mika is billed as a “smart, automatic, fast and lightweight web scraper for Python”. It uses a clever method for finding results that are similar to those you want. I suspected maybe a deep learning language embedding model was part of the secret sauce. 🌶 Nope, the code uses the Python standard library’s difflib module. difflib’s SequenceMatcher class computes the similarity after the requests and Beautiful Soup libraries grab the HTML and parse it. Like many other scrapers, AutoScraper isn’t built for dynamic JavaScript-based sites. For those situations you’ll want to use a headless browser such as Selenium to get the HTML.
Awesome Keyboard Shortcut ✂️
It’s a timesaver to be able to see a preview of your HTML or Markdown file in your code editor. In Visual Studio Code (VS Code) you can open an HTLM preview pane to the side — you don’t even need an extension. cmd
+ k
, v
is the magic key combo. 🧙♂️ 🧙♀️
Awesome VS Code Extension 💻
Speaking of using Markdown in VS Code - the Markdown All in One extension by Yu Zhang adds lots of nifty functionality for writing Markdown. The extension gives you conveniences such as the ability to create a table of contents, paste links, and make that open preview command above into a toggle preview command. 👍
But don’t just take my word for it, the extension has over 3 million installs and five stars. ⭐️⭐️⭐️⭐️⭐️
Awesome Newsletter 📨
dbt Labs puts out a lot of great thought leadership pieces on data analytics engineering. This Analytics Engineering Roundup post by Anna Filippova is a recap of some recent posts with thoughtful commentary. dbt Labs co-founder Tristan Handy is the other newsletter author and I find his posts super educational, too. 🚀
Awesome Competition 🥇
Time series analysis is an important, but often overlooked area of machine learning. The M Open Forecasting Center (MOFC) puts on forecasting competitions with cash prizes that help advance the state of the art. The recent M5 competition used WalMart sales data. In their recap paper Makridakis et al. found:
LightGBM … was used in practice by all of the top 50 competitors, thereby indicating that this method can be adopted by retail firms to improve the accuracy of their sales predictions and daily operation. However, it was also found that simple to implement and computationally cheap methods such as exponential smoothing were still competitive, especially when used to produce forecasts at the product or product-store level.
The M6 competition is all about forecasting stock prices — and it’s getting underway now, so check it out if you’re into time series. ⏰
Awesome Conference 🐍
PyCon 2022 is coming up at the end of April through the start of May. The conference will be held in Salt Lake City, with an online option for the primary conference activities. I attended my first PyCon in 2019 and had a great time. Drop me a line at @discdiver on Twitter if you are attending and want to meet up! 👋
That’s all the awesomeness for now! Until next time, stay awesome data people! 🙂 + 📊