Three “fundamental” prerequisites for drug data projects

(Adobe Firefly)

Many organizations rushing to implement AI into pharmaceutical and genomics research have reversed their priorities. Before moving forward with deep learning and large-scale language models, you need not only a solid data infrastructure, but also true data knowledge, says the developer of data platforms for scientific discovery, including a range of life science applications. argues Stavros Papadopoulos, CEO and founder of TileDB. ,

While the breakthroughs brought about by AI are understandably exciting, the underlying data challenges facing life sciences organizations run deep. According to Deloitte’s 2024 Global Life Sciences Sector Outlook, nearly 40% of the potential productivity gains in pharmaceuticals from AI will come from research and development. The report estimates that large pharmaceutical companies could save between $5 billion and $7 billion over five years if they close the gap in AI adoption. But these benefits depend on rigorous infrastructure, better data governance, and new approaches to collaboration. This is precisely the basis Papadopoulos insists on before fully embracing AI.

Currently, many research teams are still stuck in inefficiency and fragmentation. Despite triple-digit growth forecasts, only about 16% of drug discovery efforts are using AI. Data scientists often spend up to 80% of their time preparing data rather than analyzing it. There is also a pay gap. According to Glassdoor, pharmaceutical data scientists earn an average annual salary of about $124,000, but top technology companies pay well over $200,000, and some high-paying jobs in the field can earn around $1 million. Includes a substantial compensation package. This environment makes it difficult to recruit and retain top data talent in life sciences, especially since only a handful of computational biologists at large pharmaceutical companies can bridge the gap between biology and data engineering. It’s getting difficult.

Data reality check

Despite triple-digit growth forecasts, only about 16% of drug discovery activities currently use AI. — Deloitte 1

In addition to talent shortages, the data itself often goes against the traditional structure of many life sciences organizations. “99% of data is not tabular, but 99% of the research and solutions out there are focused on tables.” From Pandas in Python (for flexible data manipulation and analysis) to SQL (for relational data A suite of data science tools, from R’s Tidyverse (for data “organizing” and analysis) to Tableau (for interactive visuals and dashboards), is table-centric. Although these tools serve a variety of purposes, they often highlight a deep-rooted preference for data that can be flattened into rows and columns. The core problem is not the data, but the bad architecture and mindset. To address this, Papadopoulos proposes three “radical” premises, each setting the stage for a more mature, infrastructure-first approach, and hinting at a fourth consideration that will inevitably follow. I am.

Premise 1: “Don’t touch AI without data infrastructure”

Case in point: In some cases, a calculator is better than an LLM.

A recent preprint, From Good to Great: Improving Math Reasoning with Tool-Augmented Interleaf Prompting, highlights how LLM math problem-solving abilities can be enhanced by incorporating external tools such as calculators . Similarly, the “MathChat” framework, described in another preprint, uses conversational interactions between LLM agents and user proxy agents to tackle complex mathematical problems using code execution. has been.

“If you ask the LLM to divide a decimal, it can give you the wrong answer because it’s probabilistic and not a processor,” Papadopoulos says. “But if you connect it to a calculator tool, you’ll see that the LLM says, ‘Oh, that’s a math question. You should ask the calculator.'” Needless to say, the calculator is much more efficient than the LLM. It’s a great tool.

Regarding the bright, shiny object syndrome associated with many AI projects, Papadopoulos says bluntly: “You shouldn’t touch AI unless you have a data management infrastructure in place.” Without robust governance, security, cataloging, and unified access controls, AI adoption will only increase the disruption of existing data. In contrast, starting with a disciplined, database-centric foundation ensures that when AI is eventually deployed, it will be a simpler, well-structured approach rather than a disruptive and costly over-engineered approach. It is guaranteed to work as a power multiplier, rather than an approach that often yields slightly better results. Machine learning model.

To make the AI dream a reality, Papadopoulos advocates a data-first mindset. “Focus your efforts on building the best data management system possible. Once you’ve done the work safely manually, bring in AI to automate the tasks.” With a strong infrastructure, AI is the final piece of the puzzle, allowing you to seamlessly leverage a well-organized and well-managed data store. AI systems can then act as natural language interfaces to complex data ecosystems. Scientists can simply interact with the system instead of wrangling query languages. “This is where AI comes in. AI is not going to give you something crazy,” Papadopoulos says. You can interface with the system using natural language. This is the greatest value of AI for me. Understand what your users want and execute their queries. ”

Assumption 2: “Unstructured data does not exist”

The idea that certain data types are inherently “unstructured” is a fundamental misconception. Papadopoulos argues that every dataset, no matter how complex, contains unique patterns.
“Unstructured data does not exist…White noise may be the closest thing to having no structure, but it still follows a uniform distribution.”

Papadopoulos continued: “We leave[data]’unstructured’ because we don’t have a proper system to structure it, and that’s what causes the problem.” said.

finding order in chaos

“All data – tables, images, RNA, DNA, point clouds, satellite imagery – is essentially an array of values,” explains Papadopoulos. Even encrypted text and seemingly random signals can have patterns under the proper lens. Challenge: Current modeling approaches and SQL-centric tools often flatten multidimensional data into rows and columns, removing important context.

Implementation challenges

Relying on table-centric models incorporates rich genomic, image, and clinical data into a rigid two-dimensional schema. This discrepancy leads to loss of insight. For example, genome sequences require a structure that maintains hierarchical and multidimensional relationships, while imaging and clinical data require a format that captures complexity without artificially simplifying it.

architectural approach

Dr. Stavros Papadopoulos

Employing architectures that recognize that all data is inherently structured, such as multidimensional arrays and schema-on-demand design, allows for more nuanced modeling. A domain-specific query language can reflect scientific workflows rather than forcing everything into a SQL bottleneck.

Impact on life sciences

Recognizing that no data is truly “unstructured” frees researchers from traditional constraints. Life sciences organizations can uncover natural patterns in their data rather than skewing it to fit outdated models. This change paves the way for more accurate insights and the foundation on which advanced analytics and AI can thrive.

Assumption 3: “Currently, there are no best data practices, only bad ones.”

In Papadopoulos’ view, current norms do not even rise to the level of best practice.
“Right now there are no best practices. We’re only seeing bad practices.”

Although previously avoided, he warns that organizations that rush into AI without doing the groundwork are setting themselves up for failure. Instead, you must start by building a database-centric, secure, and discoverable data ecosystem before turning to AI. Once these foundations are in place, AI can serve as a query interface to integrated systems, allowing researchers to leverage data through natural language rather than working on fragmented ad-hoc setups.

Chart a path to better data practices

Beyond RAG: Agents, Tools, and True Data Integration

Search Augmentation Generation (RAG) helps large language models retrieve relevant text snippets (often PDFs), but is insufficient for complex scientific data. “RAG is almost exclusively PDF-only,” Papadopoulos says. That is, it cannot handle the complexity of multidimensional data, genomic data, etc.

A broader approach: We advocate employing the LLM as an orchestrator who knows which specialized tools and databases to refer to. Rather than just retrieving text, LLM can interact with integrated data infrastructures, query domain-specific APIs, perform calculations through calculators, navigate multidimensional arrays, and more. can. This turns LLM into a powerful proxy that can go far beyond simple text searches and ask the “right questions” of the data ecosystem.

Once disciplined data management is established, the LLM can act as an orchestrator, leveraging specialized tools, databases, or computational engines behind the scenes. Scientists don’t need to be API experts. Their focus remains on the research question at hand.

Papadopoulos’ roadmap challenges the status quo. He calls for a complete rethinking of data infrastructure, explicitly adhering to fundamental principles similar to FAIR (searchable, accessible, interoperable, reusable) from the start, rather than half-measures. is recommended. “If you have the discipline from day one to think about discoverability, accessibility, and everything else, you have information security covered,” he explains. By incorporating principles like FAIR into their architecture, organizations can ensure that their data is not only well managed and protected, but also well positioned for future interoperability and reusability. Masu.

Start with a robust infrastructure: Treat all scientific information, including genome sequences, images, and clinical data, as inherently structured. “Focus your efforts on building the best data management system possible,” he says. Prioritize governance and security: Authentication, authorization, and auditing should be designed in from day one. “If you start with a database approach and discipline, you have information security covered,” Papadopoulos said. Establish discipline and consistency: Consolidate all data sources into a unified, discoverable repository that aligns with your research workflow. This allows you to find “whatever you want.” Master manual workflows before automating: Understand patterns and bottlenecks. Only then should you leverage AI to automate repetitive tasks. Without a structured system to query large-scale language models (LLMs), no amount of intelligence will yield meaningful insights. Use AI as the final layer: With a solid infrastructure in place, AI can be a powerful ally and a natural language interface to a rich data ecosystem. AI works in harmony with data, rather than forcing messy data.

Essentially, these “radical” assumptions are not about adding complexity. They aim to remove barriers. When organizations realize that all data is structured and can be well managed, AI will move from hype to practical accelerator, and scientists and researchers will move from endless rotation of big data to true innovation. You will be able to concentrate.

“For the next five years as AI evolves, focus your efforts on understanding the data management story,” Papadopoulos said.

References:

This data point comes from Deloitte’s 2024 Global Life Sciences Sector Outlook report, published on May 31, 2024, which states: It will take the next three to five years. ”

See Full Bio

What's Hot

Lossless compression tailored to AI

Easy to train your model using H100 GPU on nvidia dgx cloud

Best Pytorch Quantization Backend

Lossless compression tailored to AI

Can scholars write journal papers using AI? What the guidelines say

Professor UAB builds user-friendly tools to find hidden AI security threats

BitMart Research: MCP+AI Agent – A new framework for AI

The UAE announces bold AI-led plans to revolutionize the law

The UAE will use artificial intelligence to develop new laws

Most Popular

BitMart Research: MCP+AI Agent – A new framework for AI

The UAE announces bold AI-led plans to revolutionize the law

The UAE will use artificial intelligence to develop new laws

Don't Miss

Lossless compression tailored to AI

Easy to train your model using H100 GPU on nvidia dgx cloud

Best Pytorch Quantization Backend

Subscribe to Updates

What's Hot

Three “fundamental” prerequisites for drug data projects

Data reality check

Premise 1: “Don’t touch AI without data infrastructure”

Case in point: In some cases, a calculator is better than an LLM.

Assumption 2: “Unstructured data does not exist”

finding order in chaos

Implementation challenges

architectural approach

Impact on life sciences

Assumption 3: “Currently, there are no best data practices, only bad ones.”

Chart a path to better data practices

Beyond RAG: Agents, Tools, and True Data Integration

Related Posts