GitHub - tirtharajdash/LMLFStar: Generating target-specific novel lead molecules using an LLM

LMLFStar: Generation of Target-specific Novel Lead Molecules using an LLM

LMLFStar is a molecular generation and optimization framework that uses a large language model (LLM) to discover potential lead compounds for target proteins. It employs an iterative search strategy using a Q-heuristic to explore the chemical property space efficiently.

Preprint and Publication

Submitted to bioRxiv. URL will be updated shortly.

Features

GenMol.py is an interleaved implementation of LMLFStar.py. The results in the paper are based on this.

Multiple Search Pipelines:
- GenMol1F: Single-factor optimization based on CNNaffinity.
- GenMol1Fplus: Extended feasibility testing, including molecular weight and SAS constraints.
- GenMolMF: Multi-factor search considering multiple molecular properties.
Dynamic Search Process:
- Interval-based hypothesis exploration with adaptive refinement.
- Feasibility testing based on predefined constraints.
- Iterative selection based on Q-score calculations.
Result Visualization:
- Detailed logging of each iteration.
- Automated plotting of search progress (Q-score vs. iterations).
Customizable Parameters:
- Adjustable target size, model engine, feasibility thresholds, and search iterations.

Repository Structure

LMLFStar/
├── data/             # Contains input molecule datasets
├── docking/          # Configuration files for molecular docking
├── env_utils.py      # Utility functions for environment setup
├── env.yml           # Conda environment file (make sure your machine has compatible hardware!)
├── GenMol.py         # Main script for running molecule generation pipelines
├── get_mol_prop.py   # Script for computing molecular properties
├── legacy/           # Older scripts and references
├── LICENSE           # License information
├── LMLFStar.py       # Core functions for molecule generation (using models from OpenAI and Anthropic)
├── mol_utils.py      # Utilities for molecular processing
├── README.md         # Documentation for the repository
├── results/          # Stores generated molecules and search results
├── run.sh            # Shell script to execute the pipeline
├── safe/             # Backup or checkpointed files
├── search.py         # Implements the hypothesis-driven search strategy
├── unit_test.ipynb   # Jupyter Notebook for testing components
├── Tree.ipynb        # Analysis of the generated molecule search tree
├── *_claude.*        # Added later for Anthropic models (Claude-*)
├── gen_GPT2_*.ipynb  # Added later for generating models from domain-specific model (Molecule GPT2)
└── nohup.out         # Log output from nohup execution

Installation

conda env create -f env.yml
conda activate chem

Additionally, you will need the gnina software for docking. The current implementation uses the official version v1.3: GNINA v1.3.

Usage

python GenMol.py --protein DBH --target_size 5 --choice mf --context True --model gpt-4o --final_k 100

Arguments:

Argument	Description
`--protein`	Target protein name (e.g., `DBH`)
`--target_size`	Number of target molecules to reveal to LLM per iteration
`--choice`	Selects the pipeline (1: `1f`, 2: `1fplus`, 3: `mf`)
`--context`	Enables context-based molecule generation (True/False)
`--model`	LLM model used (e.g., `gpt-4o`)
`--final_k`	Number of molecules in the final generation step after search is complete

see run.sh for a batch run.

Example Runs

Run GenMol1F (Single-factor search):

python GenMol.py --protein DBH --target_size 5 --choice 1 --context False --model gpt-4o --final_k 10

Run GenMolMF (Multi-factor search):

python GenMol.py --protein DBH --target_size 5 --choice 3 --context True --model gpt-4o --final_k 10

Function Call Structure

Below is a simplified structure of how the functions call each other:

GenMol.py
├── main()
    ├── Parses command-line arguments
    ├── Calls the appropriate pipeline:
        ├── GenMol1F()
        │   ├── setup_environment()
        │   ├── interleaved_LMLFStar()
        │       ├── Hypothesis()
        │       ├── compute_Q()
        │       ├── generate_molecules_for_protein()
        │       ├── generate_molecules_for_protein_with_context()
        │       ├── Logging and results handling
        ├── GenMolMF()
        │   ├── setup_environment()
        │   ├── interleaved_LMLFStar()
        │       ├── Hypothesis()
        │       ├── compute_Q()
        │       ├── generate_molecules_for_protein_multifactors()
        │       ├── generate_molecules_for_protein_multifactors_with_context()
        │       ├── Logging and results handling

[Aug 04, 2025] There is now a GenMol_claude.py for implementation of GenMol using Anthropic's Claude models.)

Contact

For any questions or contributions, feel free to raise an issue or submit a pull request.

References

Some relevant sources and references for LMLF:

LMLF codebase: LMLF
LMLF paper: AAAI 2024
McCreath and Sharma, LIME: ALT 1998

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LMLFStar: Generation of Target-specific Novel Lead Molecules using an LLM

Preprint and Publication

Features

Repository Structure

Installation

Usage

Arguments:

Example Runs

Run GenMol1F (Single-factor search):

Run GenMolMF (Multi-factor search):

Function Call Structure

Contact

References

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
data		data
docking		docking
legacy		legacy
plots		plots
results		results
unit_tests		unit_tests
.gitignore		.gitignore
2Z65_TLR4_analysis.ipynb		2Z65_TLR4_analysis.ipynb
GenMol.py		GenMol.py
GenMol_ZincGPT2.py		GenMol_ZincGPT2.py
GenMol_claude.py		GenMol_claude.py
LICENSE		LICENSE
LMLFStar.py		LMLFStar.py
README.md		README.md
Tree.ipynb		Tree.ipynb
analysis.ipynb		analysis.ipynb
env.yml		env.yml
env_utils.py		env_utils.py
gen_GPT2_ZINC_87M.ipynb		gen_GPT2_ZINC_87M.ipynb
get_mol_prop.py		get_mol_prop.py
mol_utils.py		mol_utils.py
run.sh		run.sh
search.py		search.py

Folders and files

Latest commit

History

Repository files navigation

LMLFStar: Generation of Target-specific Novel Lead Molecules using an LLM

Preprint and Publication

Features

Repository Structure

Installation

Usage

Arguments:

Example Runs

Run GenMol1F (Single-factor search):

Run GenMolMF (Multi-factor search):

Function Call Structure

Contact

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages