Biology-Native Data Infrastructure Aims to Accelerate AI-Driven Drug Development
Drug discovery remains stubbornly slow and failure-prone, with ~90% of clinical candidates never reaching approval. A new push toward "biology-native" data infrastructure argues that the way biological data is stored and structured — not just the AI models on top — is a root cause of the bottleneck.
Explanation
Drug development is one of the most expensive and failure-prone endeavors in science. On average, it takes more than five years just to move from identifying a promising biological target to having a drug candidate ready for human trials — and even then, roughly nine out of ten drugs that enter those trials will ultimately fail. The costs, both financial and human, are enormous.
Artificial intelligence (AI) has been widely promoted as a solution to this problem. The idea is that machine learning models can sift through vast amounts of biological data — genomics, protein structures, clinical records — and find patterns that human researchers would miss. In practice, however, AI tools in drug discovery have so far delivered incremental rather than transformative gains.
One emerging argument is that the problem isn't just the AI models themselves, but the data they are trained on. Biological data is messy, siloed (stored in separate, incompatible systems), and often structured in ways that were designed for human readability rather than machine learning. "Biology-native" data infrastructure refers to the idea of building databases and data pipelines from the ground up with the specific structure and complexity of biological information in mind — so that AI systems can actually use the data effectively.
Think of it like trying to teach someone to cook using a recipe written in a foreign language with missing steps. Even a talented chef would struggle. Better-structured data is the equivalent of a clear, complete recipe.
This is an incremental development in the field — not a breakthrough announcement, but a conceptual and infrastructural shift that could have compounding effects over time. Whether it meaningfully shortens drug development timelines or reduces failure rates remains to be demonstrated at scale.
The core thesis here is that data architecture, not just algorithmic sophistication, is a binding constraint in AI-driven drug discovery. This is a well-recognized problem in the field but one that has received less public attention than model development. Biological data spans multiple modalities — genomic sequences, transcriptomic profiles, proteomic assays, phenotypic screens, electronic health records — each generated by different instruments, annotated with different ontologies, and stored in incompatible formats. The result is that significant ML engineering effort is spent on data wrangling rather than model training or biological insight.
"Biology-native" infrastructure, as a concept, implies designing data schemas, storage systems, and query layers that natively represent biological entities (genes, proteins, pathways, cell types, disease states) and their relationships, rather than forcing biological data into generic relational or document-store paradigms. This is adjacent to, but distinct from, existing efforts like knowledge graphs (e.g., Open Targets, the Biomedical Data Translator) and multimodal foundation models (e.g., Geneformer, scGPT). The distinction lies in emphasis: those projects focus on model architecture, whereas biology-native infrastructure focuses on the upstream data layer.
The 90% clinical failure rate cited in the source is a well-established industry figure, though it aggregates across therapeutic areas and modalities with very different failure profiles. Oncology fails at higher rates; vaccines and some rare disease programs fare better. The five-year target-to-candidate timeline is similarly a rough industry average, with significant variance by modality (small molecules vs. biologics vs. cell therapies). These statistics frame the problem accurately but should not be taken as uniform across all drug development contexts.
From a methodology standpoint, the signal here is conceptual and infrastructural rather than empirical. There is no reported dataset, model benchmark, or clinical outcome tied to this specific framing. The claim that better data infrastructure will reduce failure rates is plausible and theoretically grounded, but it has not been validated with controlled evidence in this context. Prior art — such as the federated data efforts at the FDA's Sentinel System or the UK Biobank's structured multimodal data — suggests that well-curated biological data does improve downstream analytical power, lending indirect support.
Open questions are substantial. What specific data types or biological relationships are currently most poorly represented in existing infrastructure? How does biology-native design interact with data privacy constraints, particularly for clinical and patient-level data? Will the gains be primarily in early discovery (target ID, lead optimization) or extend into translational and clinical phases? And critically, how will interoperability be maintained across institutions and platforms — a challenge that has historically undermined even well-funded data standardization efforts.
A falsifiable version of this claim would look like: organizations adopting biology-native infrastructure demonstrating statistically significant reductions in time-to-candidate or improvements in Phase II success rates compared to matched controls using conventional data systems. Without such evidence, this remains a compelling architectural argument rather than a proven intervention.
Reality meter
Time horizon
Community read
Glossary
- Data wrangling
- The process of cleaning, transforming, and organizing raw data into a usable format for analysis. In this context, it refers to the significant engineering effort spent preparing biological data from different sources before it can be used for machine learning model training.
- Biology-native infrastructure
- Data systems and schemas designed specifically to represent biological entities (genes, proteins, pathways, cell types) and their relationships natively, rather than forcing biological data into generic database structures that don't naturally fit biological concepts.
- Multimodal foundation models
- Large-scale machine learning models trained on multiple types of data (such as genomic sequences, protein structures, and clinical information) simultaneously, enabling them to learn relationships across different biological data types.
- Ontologies
- Standardized systems for organizing and categorizing information within a specific domain, defining how different concepts relate to each other. In biology, different instruments and databases use different ontologies, making data integration difficult.
- Federated data
- Data that remains stored across multiple independent institutions or systems but can be queried and analyzed collectively without centralizing all the data in one location, useful for maintaining privacy while enabling large-scale analysis.
- Phase II success rates
- The percentage of drug candidates that successfully advance from Phase II clinical trials (which test efficacy and side effects in larger patient groups) to the next stage of development, a key metric for drug development efficiency.
Sources
No sources on file.
Prediction
Will a drug discovery platform citing biology-native data infrastructure demonstrate a statistically validated reduction in time-to-clinical-candidate by 2028?
Vote
Your vote feeds topic weights, community direction and future prioritisation. Open community direction