From 10 Petabytes worth of data in Iceberg to LLMs, Cloud Migrations and Beyond!

The Data Science Dossier

Welcome to this weeks Data Science Dossier. Its quite a large one this week, there’s plenty going on as we get further into January and the vendors and suppliers get their collective acts together and spring into life.

From LLMs(I know I try and mostly ignore them due to so many folks covering them, but there’s some interesting stuff there) through to new database support in AWS, to some staggering stats from Bilibili and their Apache Iceberg OLAP Lakehouse, which truely demonstrates the capabilities of these open source platforms and how effective design can provide such value to users.

Youtube Version

We now make a youtube version of this roundup. If you’d prefer it in video form enjoy!

From the community

  • First up this week we’ve got an interesting blog post from the folks over at Apache Beam about scaling workloads. For those of you who follow me, you’ll know I said at the start of the year that 2024 is going to be the real switch over to streaming systems for a lot of companies who are more used to processing in batch and posts like this go a long way to promoting that fact. So scaling beam to 1 million events per second and beyond. A very cool and detailed blog post.

  • Next up is, once again, Databricks, this is actually a post from December, but worth pointing out to anyone in the AI space that missed it. Databricks now have a Vector Search framework built into their platorm. So if you’re into LLMs and leverage Databricks, embed away!

  • Also out of Google Cloud this week they have a blog post on how to use BigQuery data in LangChain. For those who don’t know LangChain is a framework that allows you to leverage LLMs with disparate datasources and control that pipeline, and is pretty cool.

  • Sticking with Google, Google Cloud Next is on April 9 through 11th in Vegas if thats your type of thing, mingling with other folks and sharing germs! On the plus side there is early bird pricing through to the end of January.

  • Just to finish the LLM side of the house for today, the UK Government have released their GenAI guidance and it makes for interesting reading if you’re part of a large org trying to work out how to leverage AI in a sensitive space.

  • Moving on to Amazon and Postgres now supports the RDS Data API for folks that leveraged the original API, it has been redesigned for scale.

  • Also, for Amazon users who are using Snowflake, the Kinesis Data Firehose now supports Snowflake as a datasource leveraging Snowpipe Streaming. Just because there wasn’t enough data stored already!

  • Spotted on Linkedin a new startup has started rolling out access to their service in a private beta form. TinyAPI helps developers monetize their APIs, interesting stuff!

  • DARPA have recognised the PDF Association’s contribution to Safedocs and the great work done there.

  • And finally this week a post I saw detailing how Bilibili built an OLAP Data Lakehouse with Apache Iceberg. Pretty interesting reading and some amazing stats. 1000 tables, 10PB of data, with an additional 75TB landing each day. And with that Trino serving up 200,000 queries each day with an average speed of 5 seconds. Great insight!

On the blog

Cloud migrations can be tricky, and there are a number of points during the migration that may cause it to get bogged down, behind schedule or worse, fail. So this week we’ve compiled 10 tips to ensure your cloud migration works well. From planning, to in house experts and training. Make sure you get it right.

I’m Tom Barber

I assist businesses in maximizing the value of their data, enhancing efficiency, performance, and gaining deeper insights. Find out more on my website.

Reply

or to participate.