Embedding creation - danifilho/Evo2_BASF GitHub Wiki
Using the script "embed_windows_full.py" (i need to improve it a little like the previous one)
import torch, pathlib
from evo2 import Evo2
# 1. Loading Evo 2
evo2_model = Evo2("evo2_7b") # auto-loads on GPU-0 if CUDA is visible (i was with problems with slurm on hpcc)
tok = evo2_model.tokenizer
# 2. Reading the FASTA window
fa_path = pathlib.Path("windows_full/NC_003075.7_13099008_13107199.fa") #this is a random 8192 tokens piece from chromosome 4
sequence = fa_path.read_text().splitlines()[1].strip().upper() # 8 192 bp
# 3. Tokenizing & moving input to GPU if available
input_ids = torch.tensor(tok.tokenize(sequence), dtype=torch.int)[None] # (1,8192)
if torch.cuda.is_available():
input_ids = input_ids.cuda()
layer_name = "blocks.28.mlp.l3" # actually for BRCA1 notebook on the paper they told us that the blocks.27.mlp.l3 performed better
# 4. Forwarding pass
with torch.inference_mode(): # no gradients, faster
outputs, embeddings = evo2_model(
input_ids,
return_embeddings=True,
layer_names=[layer_name],
)
print("Embeddings shape:", embeddings[layer_name].shape) # returned (1, 8192, 4096)
# 5. Saving to disk
out_dir = pathlib.Path("embeddings_full"); out_dir.mkdir(exist_ok=True)
torch.save(
embeddings[layer_name].squeeze(0).cpu(), # (8192, 4096)
out_dir / (fa_path.stem + ".pt")
)
print("Saved tensor →", out_dir / (fa_path.stem + ".pt"))
oh and to run i did with
singularity exec --nv -B "$PWD/huggingface:/root/.cache/huggingface" -B "$PWD/windows_full:/workspace/windows_full" -B "$PWD/embeddings_full:/workspace/embeddings_full" -B "$PWD/embed_windows_full.py:/workspace/embed_windows_full.py" evo2_latest.sif python3 /workspace/embed_windows_full.py
directly in the node, i was trying with the slurm_script but it needs a little bit more of understanding (nick told me about salloc to run interactively, i'll study it later)