The Data Science Dossier
Posts
Unlocking Data: Change Data Capture, Llama2, Budget Airlines, and Similarity Matching - This Week's Data Dossier!

Unlocking Data: Change Data Capture, Llama2, Budget Airlines, and Similarity Matching - This Week's Data Dossier!

Tom Barber
October 23, 2023

Welcome to this week's Dossier! Brace yourself for a delightful dive into captivating topics. On our Linkedin channel, we've been stirring up an engaging SQL discussion. Did you know that SQL's first official standard was published way back in 1986? Talk about an ancient treasure! But hold your breath, it wasn't until SQL-92 that Joins officially joined the party. Time sure knows how to fly, doesn't it? 😉

From the community

Dive deep into the inner workings of Stripe as they leverage Apache Flink for Change Data Capture. Explore the intricate mechanisms behind this process and gain a comprehensive understanding of how it all comes together.
Did you hear about Databricks Lakehouse AI? They've got these amazing Llama2 models that have actually outperformed GPT 3.5-turbo in recent tests! You can find these improved models in their marketplace, specifically designed for chat bots. And the best part? They come with MLflow, making it super easy to use their Evaluation API and deploy to Databricks GPU model serving endpoints.
Now you can easily run Dataflow jobs right from your IntelliJ IDE with the new Cloud Code plugin by Google. It's a powerful tool that seamlessly guides you through project creation and publishing to Google's Dataflow. If you, like me, often work with Apache Beam projects on Dataflow, this plugin is a total game-changer. Say goodbye to complexities and embrace the convenience and efficiency it offers!
Continuing with Apache Beam, I came across a link to an incredibly captivating blog post on their website. It delves into how Linkedin handles a mind-boggling 4 Trillion events every single day. Yes, you read that right - 4 trillion! It's truly awe-inspiring. If you're as intrigued as I am, you can find more information here.
Let's take a closer look at easyJet's strategic investment in Databricks through an interesting case study. With the power of this platform, combined with their vast data resources and AI capabilities, easyJet is aiming to unlock a wide range of optimizations in their operations. This integration has the potential to boost efficiency, streamline processes, and ultimately drive even more success.

On the blog

In today’s data-driven world, the ability to match and compare strings effectively is invaluable. Python offers powerful methods for similarity matching, including the Jaro-Winkler algorithm. This algorithm, commonly used in record linkage and data deduplication, takes into account character position and order, making it more accurate than simpler methods.

To leverage the Jaro-Winkler method in Python, we can use the recordlinkage library, which provides an implementation of this algorithm. With a step-by-step process of data preparation, function implementation, and result interpretation, we'll guide you on efficiently performing similarity matching using Python and recordlinkage.

By combining Python’s Jaro-Winkler method with PostgreSQL’s pg_trgm extension, you can create powerful and efficient data processing pipelines. Stay tuned for a deep dive into these exciting topics.

Jobs

Lets see whats open on the job market this week:

Open source company Canonical have an opening for a Kafka Data Infrastructure Engineer
UK Tv company ITV are hiring a Head of Data Engineering
Hubspot are looking for a Senior Data Engineer
Container stalwarts Docker are hiring a Senior Data Engineer

I’m Tom Barber

I assist businesses in maximizing the value of their data, enhancing efficiency, performance, and gaining deeper insights. Find out more on my website.

Reply

or to participate.