- The Data Science Dossier
- Posts
- The Data Science Dossier
The Data Science Dossier
A weekly roundup of data science

From the community
AWS Announces Amazon DataZone GA to Simplify Data Discovery and Governance: Amazon DataZone is a new service from AWS that makes it easier for organizations to discover, share, and govern their data. It provides a central place for users to find and access data from a variety of sources, including on-premises, cloud, and data lakes. DataZone also includes features for data governance, such as access control, lineage tracking, and auditing.
Revolutionizing Bioscience Research: Creating an Atlas of the Human Body: A team of researchers at HPE, NVIDIA, and Flywheel are working to create an atlas of the human body using high-performance computing (HPC), artificial intelligence (AI), and edge-to-cloud technologies. The goal of the project is to create a comprehensive map of the human body that can be used to accelerate biomedical research and improve healthcare outcomes.
Meta's Llama 2 Long: Longer Memory, Handles Heftier Tasks: Meta AI has released a new large language model (LLM) called Llama 2 Long. Llama 2 Long has a longer memory than previous LLMs, which allows it to perform better on tasks such as summarization, translation, and question answering. It can also handle heftier tasks, such as generating code and writing different kinds of creative content.
Google AI researchers propose advanced long-context LLMs: Researchers at Google AI have proposed a new method for training LLMs that can handle longer contexts. The new method allows LLMs to learn from more data and to generate more accurate and informative responses.
Google's DeepMind is revolutionizing robotics: Google's DeepMind AI research team is developing new AI algorithms that are revolutionizing robotics. DeepMind's algorithms have enabled robots to learn new tasks quickly and to perform complex tasks in challenging environments.
Tom Hanks Warns Fans of AI Generated Deepfake: Actor Tom Hanks has warned fans of the dangers of AI-generated deepfakes. Deepfakes are videos or audio recordings that have been manipulated to make it look or sound like someone is saying or doing something they never did. Hanks has urged fans to be critical of the information they see online and to be aware of the potential for deepfakes to be used to spread misinformation.
Leveraging AI to Prevent Homelessness: A Game-Changer in Los Angeles: The city of Los Angeles is using AI to help prevent homelessness. The city has developed an AI-powered model that can predict which individuals are most at risk of becoming homeless. This information is then used to provide targeted support services to these individuals.
The 300% Problem: A Challenge for Infrastructure as Code: Not data specific but I found this article interesting from many places I have worked over the years. The 300% problem is a concept introduced by Lee Briggs in his blog post "The 300% Production Problem". It is the idea that in order to successfully use an abstraction, you need to understand the problem the abstraction is trying to solve, understand how the abstraction has solved the problem, and understand how the abstraction works in production. Briggs uses the example of Terraform modules to illustrate the 300% problem. Terraform modules are designed to solve specific problems in the cloud provider ecosystem, but in order to use them effectively, you need to understand the underlying infrastructure and how the module works in production. The 300% problem can be a challenge for organizations of all sizes, but it is especially challenging for organizations that are new to infrastructure as code (IaC). IaC tools like Terraform can be very powerful, but they can also be complex and difficult to use.
On the blog
Data pipelines are the superhighways of the modern data landscape. They consolidate information from disparate sources and deliver it to your analysis and reporting tools.
But building effective data pipelines is challenging. Without the right architecture and components, your pipelines get congested with bottlenecks and delays.
This post reveals key insights on constructing fast, reliable data pipelines:
Discover 3 main architectural styles for data pipelines: batch, streaming, and hybrid. Understand how each approach impacts factors like latency and flexibility.
Learn the core components like ingestion, transformations, storage, and orchestration. Master these building blocks to avoid issues.
Follow best practices like automation, monitoring, and modular design. Avoid common pitfalls to gain value from your data.
See why cloud-based data pipelines provide flexibility, scalability, and less maintenance. Major providers offer managed solutions.
Don't leave your data stuck in traffic. Read this post now to start building your data superhighway!

I’m Tom Barber
I assist businesses in maximizing the value of their data, enhancing efficiency, performance, and gaining deeper insights. Find out more on my website.
Reply