Embedding creation - danifilho/Evo2_BASF GitHub Wiki

Using the script "embed_windows_full.py" (i need to improve it a little like the previous one)

import torch, pathlib
from evo2 import Evo2

# 1. Loading Evo 2 
evo2_model = Evo2("evo2_7b")           # auto-loads on GPU-0 if CUDA is visible (i was with problems with slurm on hpcc)
tok        = evo2_model.tokenizer

# 2. Reading the FASTA window 
fa_path = pathlib.Path("windows_full/NC_003075.7_13099008_13107199.fa") #this is a random 8192 tokens piece from chromosome 4
sequence = fa_path.read_text().splitlines()[1].strip().upper()   # 8 192 bp

# 3. Tokenizing & moving input to GPU if available 
input_ids = torch.tensor(tok.tokenize(sequence), dtype=torch.int)[None]  # (1,8192)
if torch.cuda.is_available():
    input_ids = input_ids.cuda()

layer_name = "blocks.28.mlp.l3" # actually for BRCA1 notebook on the paper they told us that the blocks.27.mlp.l3 performed better 

# 4. Forwarding pass 
with torch.inference_mode():           # no gradients, faster
    outputs, embeddings = evo2_model(
        input_ids,
        return_embeddings=True,
        layer_names=[layer_name],
    )

print("Embeddings shape:", embeddings[layer_name].shape)   # returned (1, 8192, 4096)

# 5. Saving to disk 
out_dir = pathlib.Path("embeddings_full"); out_dir.mkdir(exist_ok=True)
torch.save(
    embeddings[layer_name].squeeze(0).cpu(),               # (8192, 4096)
    out_dir / (fa_path.stem + ".pt")
)
print("Saved tensor →", out_dir / (fa_path.stem + ".pt"))

oh and to run i did with

singularity exec --nv   -B "$PWD/huggingface:/root/.cache/huggingface"   -B "$PWD/windows_full:/workspace/windows_full"   -B "$PWD/embeddings_full:/workspace/embeddings_full"   -B "$PWD/embed_windows_full.py:/workspace/embed_windows_full.py"   evo2_latest.sif   python3 /workspace/embed_windows_full.py

directly in the node, i was trying with the slurm_script but it needs a little bit more of understanding (nick told me about salloc to run interactively, i'll study it later)