| .. | ||
| README.md | ||
Use Open Fold
Use OpenFold with TensorRT optimization
Table of Contents
Overview
What you'll accomplish
You'll set up a GPU-accelerated protein folding workflow on NVIDIA Spark devices using OpenFold with TensorRT optimization and MMseqs2-GPU. After completing this walkthrough, you'll be able to fold proteins in under 60 seconds using either NVIDIA's cloud UI or running locally on your RTX Pro 6000 or DGX Spark workstation.
What to know before starting
- Installing Python packages via pip
- Using Docker and the NVIDIA Container Toolkit for GPU workflows
- Running basic Linux commands and setting environment variables
- Understanding FASTA files and basics of protein structure workflows
- Working with CUDA-enabled applications
Prerequisites
- NVIDIA GPU (RTX Pro 6000 or DGX Spark recommended)
nvidia-smi # Should show GPU with CUDA ≥12.9 - NVIDIA drivers and CUDA toolkit installed
nvcc --version # Should show CUDA 12.9 or higher - Docker with NVIDIA Container Toolkit
docker run --rm --gpus all nvidia/cuda:12.9.0-base-ubuntu22.04 nvidia-smi - Python 3.8+ environment
python3 --version # Should show 3.8 or higher - Sufficient disk space for databases (>3TB recommended)
df -h # Check available space
Ancillary files
- OpenFold parameters (
finetuning_ptm_2.pt) — pre-trained model weights for structure prediction - PDB70 database — template structures for homology modeling
- UniRef90 database — sequence database for MSA generation
- MGnify database — metagenomic sequences for MSA generation
- Uniclust30 database — clustered UniProt sequences for MSA generation
- BFD database — large sequence database for MSA generation
- MMCIF files — template structure files in mmCIF format
- py3Dmol package — Python library for 3D protein visualization
Time & risk
Duration: Initial setup takes 2-4 hours (mainly downloading databases). Each protein fold takes <60 seconds on GPU vs hours on CPU.
Risks:
- Database downloads may fail due to network interruptions
- Insufficient disk space for full databases
- GPU memory limitations for very large proteins (>2000 residues)
Rollback: All operations are read-only after setup. Remove downloaded databases and output directories to clean up.
Access through terminal
Step 1. Verify GPU and CUDA installation
Confirm your system has the required GPU and CUDA version for running OpenFold with TensorRT optimization.
nvidia-smi
Expected output should show an NVIDIA GPU with CUDA capability ≥12.9. For DGX Spark or RTX Pro 6000, you should see the appropriate GPU model listed.
nvcc --version
This should display CUDA compilation tools, release 12.9 or higher.
Step 2. Set up Python environment
Create a Python virtual environment and install the required packages for protein folding and visualization.
python3 -m venv openfold_env
source openfold_env/bin/activate
pip install --upgrade pip
Install the py3Dmol visualization package:
pip install py3Dmol
Step 3. Download OpenFold and databases
Download the OpenFold repository and required databases. This step requires significant disk space and network bandwidth.
TODO: Add specific download URLs for OpenFold repository from official GitHub
## Clone OpenFold repository
git clone https://gitlab.com/nvidia/dgx-spark/temp-external-playbook-assets/dgx-spark-playbook-assets
cd ${MODEL}/assets
pip install -e .
Download the model parameters:
TODO: Add direct download URL for finetuning_ptm_2.pt
mkdir -p openfold_params
wget -O openfold_params/finetuning_ptm_2.pt <PARAM_DOWNLOAD_URL>
Step 4. Download sequence databases
Download all required databases for MSA generation. Each database serves a specific purpose in the folding pipeline.
TODO: Add specific download URLs for each database from official sources
## Create database directory
mkdir -p databases
cd databases
## Download PDB70 (for template structures)
wget <PDB70_DOWNLOAD_URL>
tar -xzf pdb70.tar.gz
## Download UniRef90 (for MSA)
wget <UNIREF90_DOWNLOAD_URL>
tar -xzf uniref90.tar.gz
## Download MGnify (metagenomic sequences)
wget <MGNIFY_DOWNLOAD_URL>
tar -xzf mgnify.tar.gz
## Download Uniclust30 (clustered sequences)
wget <UNICLUST30_DOWNLOAD_URL>
tar -xzf uniclust30.tar.gz
## Download BFD (large sequence database)
wget <BFD_DOWNLOAD_URL>
tar -xzf bfd.tar.gz
## Download MMCIF files (structure templates)
wget <MMCIF_DOWNLOAD_URL>
tar -xzf mmcif.tar.gz
cd ..
Step 5. Configure environment variables
Set up environment variables pointing to your downloaded databases and parameters.
export OF_PARAM_PATH="$(pwd)/openfold_params/finetuning_ptm_2.pt"
export OF_DB_PDB70="$(pwd)/databases/pdb70"
export OF_DB_UNIREF90="$(pwd)/databases/uniref90"
export OF_DB_MGNIFY="$(pwd)/databases/mgnify"
export OF_DB_UNICLUST30="$(pwd)/databases/uniclust30"
export OF_DB_BFD="$(pwd)/databases/bfd"
export OF_DB_MMCIF="$(pwd)/databases/pdb_mmcif/mmcif_files"
export OF_DB_OBSOLETE="$(pwd)/databases/pdb_mmcif/obsolete.dat"
export OF_DEVICE="cuda:0"
export OF_OUTDIR="openfold_out"
export OF_JOB="demo"
Step 6. Option A - Use NVIDIA Build Portal (Cloud UI)
For quick testing without local setup, use NVIDIA's online demo interface.
-
Navigate to the OpenFold2 page on NVIDIA Build Portal
TODO: Add specific URL for NVIDIA Build Portal OpenFold2 demo
-
Paste your protein sequence in FASTA format
-
Click "Run" to execute the folding pipeline
-
View results in the integrated Mol* or py3Dmol viewer
Step 7. Option B - Run locally with demo script
Create and run the OpenFold demo script for local execution on your DGX Spark or RTX Pro 6000.
Create the demo script file:
cat > openfold_demo.py << 'EOF'
#!/usr/bin/env python3
"""
Single-file OpenFold runner + py3Dmol viewer.
"""
import os, subprocess as sp, glob, sys, tempfile, textwrap
## Paths (edit for your system)
PARAM = os.getenv("OF_PARAM_PATH", "/path/to/openfold_params/finetuning_ptm_2.pt")
PDB70 = os.getenv("OF_DB_PDB70", "/path/to/pdb70")
UNIREF90 = os.getenv("OF_DB_UNIREF90", "/path/to/uniref90")
MGNIFY = os.getenv("OF_DB_MGNIFY", "/path/to/mgnify")
UNICLUST30 = os.getenv("OF_DB_UNICLUST30", "/path/to/uniclust30")
BFD = os.getenv("OF_DB_BFD", "/path/to/bfd")
MMCIF = os.getenv("OF_DB_MMCIF", "/path/to/pdb_mmcif/mmcif_files")
OBSOLETE = os.getenv("OF_DB_OBSOLETE", "/path/to/pdb_mmcif/obsolete.dat")
DEVICE = os.getenv("OF_DEVICE", "cuda:0")
OUTDIR = os.getenv("OF_OUTDIR", "openfold_out")
JOB = os.getenv("OF_JOB", "demo")
SEQ = """>demo
MGSDKIHHHHHHENLYFQGAMASMTGGQQMGRGSMAAAAKKVVAGAAAAGGQAGD"""
def ensure_py3dmol():
try:
import py3Dmol
except ImportError:
sp.check_call([sys.executable, "-m", "pip", "install", "py3Dmol"])
def run_openfold(fasta_path):
cmd = [
sys.executable, "openfold/run_pretrained_openfold.py",
"--fasta_path", fasta_path,
"--job_name", JOB,
"--output_dir", OUTDIR,
"--model_device", DEVICE,
"--param_path", PARAM,
"--pdb70_database_path", PDB70,
"--uniref90_database_path", UNIREF90,
"--mgnify_database_path", MGNIFY,
"--uniclust30_database_path", UNICLUST30,
"--bfd_database_path", BFD,
"--template_mmcif_dir", MMCIF,
"--obsolete_pdbs_path", OBSOLETE,
"--skip_relaxation"
]
sp.check_call(cmd)
def visualize():
import py3Dmol
pdb = open(f"{OUTDIR}/{JOB}/ranked_0.pdb").read()
view = py3Dmol.view(width=800, height=520)
view.addModel(pdb, "pdb")
view.setStyle({"cartoon": {"arrows": True}})
view.zoomTo()
open(f"{OUTDIR}/{JOB}_view.html", "w").write(view._make_html())
print(f"Viewer written to {OUTDIR}/{JOB}_view.html")
def main():
ensure_py3dmol()
with tempfile.TemporaryDirectory() as td:
fasta_path = os.path.join(td, f"{JOB}.fasta")
open(fasta_path, "w").write(textwrap.dedent(SEQ).strip() + "\n")
run_openfold(fasta_path)
visualize()
if __name__ == "__main__":
main()
EOF
Make the script executable and run it:
chmod +x openfold_demo.py
python openfold_demo.py
Step 8. Validate the output
Check that the folding completed successfully and view the generated structure.
## Verify PDB file was created
ls -la openfold_out/demo/ranked_0.pdb
The file should exist and be non-empty (typically >10KB for a small protein).
## Check the HTML viewer was generated
ls -la openfold_out/demo_view.html
Open the HTML file in a web browser to visualize the folded protein structure:
## On Linux with GUI
xdg-open openfold_out/demo_view.html
## Or copy the full path and open in browser manually
realpath openfold_out/demo_view.html
Step 9. Run with custom sequences
To fold your own protein sequences, modify the demo script or create a new FASTA file.
Using a custom FASTA file
## Create your FASTA file
cat > my_protein.fasta << 'EOF'
>my_protein
MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLPARTVETRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMNCKCVIS
EOF
## Run OpenFold directly
python openfold/run_pretrained_openfold.py \
--fasta_path my_protein.fasta \
--job_name my_protein \
--output_dir openfold_out \
--model_device cuda:0 \
--param_path $OF_PARAM_PATH \
--pdb70_database_path $OF_DB_PDB70 \
--uniref90_database_path $OF_DB_UNIREF90 \
--mgnify_database_path $OF_DB_MGNIFY \
--uniclust30_database_path $OF_DB_UNICLUST30 \
--bfd_database_path $OF_DB_BFD \
--template_mmcif_dir $OF_DB_MMCIF \
--obsolete_pdbs_path $OF_DB_OBSOLETE \
--skip_relaxation
Step 10. Troubleshooting common issues
| Symptom | Cause | Fix |
|---|---|---|
| CUDA out of memory error | Protein too large for GPU | Reduce max_template_date or use smaller sequence |
| Database file not found | Incomplete download or wrong path | Verify all databases downloaded and paths in env vars |
| ImportError: No module named 'openfold' | OpenFold not installed | Run pip install -e . in openfold directory |
| nvidia-smi command not found | NVIDIA drivers not installed | Install NVIDIA drivers for your GPU |
| Folding takes hours instead of minutes | Running on CPU instead of GPU | Check OF_DEVICE="cuda:0" and GPU availability |
| py3Dmol viewer shows blank page | JavaScript blocked or path issue | Use absolute path to HTML file or check browser console |
Step 11. Cleanup and rollback
Remove generated outputs and optionally remove downloaded databases.
## Remove output files only (safe)
rm -rf openfold_out/
## Remove virtual environment (reversible)
deactivate
rm -rf openfold_env/
Warning: The following will delete downloaded databases (>3TB). Only run if you need to free disk space and are willing to re-download.
## Remove all databases (requires re-download)
rm -rf databases/
## Remove OpenFold repository
rm -rf openfold/
Step 12. Next steps
Test the installation with a well-known protein structure to verify accuracy:
## Test with ubiquitin (PDB: 1UBQ)
cat > test_ubiquitin.fasta << 'EOF'
>1UBQ
MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG
EOF
python openfold/run_pretrained_openfold.py \
--fasta_path test_ubiquitin.fasta \
--job_name ubiquitin_test \
--output_dir openfold_out \
--model_device cuda:0 \
--param_path $OF_PARAM_PATH \
--pdb70_database_path $OF_DB_PDB70 \
--uniref90_database_path $OF_DB_UNIREF90 \
--mgnify_database_path $OF_DB_MGNIFY \
--uniclust30_database_path $OF_DB_UNICLUST30 \
--bfd_database_path $OF_DB_BFD \
--template_mmcif_dir $OF_DB_MMCIF \
--obsolete_pdbs_path $OF_DB_OBSOLETE \
--skip_relaxation
For production use, consider:
- Enabling structure relaxation for higher accuracy (remove
--skip_relaxation) - Setting up batch processing for multiple sequences
- Integrating with drug discovery pipelines
- Scaling to full proteomes using DGX Spark clusters