|
18 | 18 | " generate many designs and filter them with a set of computational filters and\n", |
19 | 19 | " ranking mechanisms.\n", |
20 | 20 | "\n", |
21 | | - "And we validated a small number of the generated designs in a wet lab, which of course you can also do... but this notebook isn't very helpful with that!" |
| 21 | + "And we validated a small number of the generated designs in a wet lab, which of course you can also do... but this notebook isn't very helpful with that!\n" |
22 | 22 | ] |
23 | 23 | }, |
24 | 24 | { |
|
29 | 29 | "source": [ |
30 | 30 | "## Set up the notebook and model (via the Forge API).\n", |
31 | 31 | "\n", |
32 | | - "We begin by installing the [esm package](https://github.com/evolutionaryscale/esm) and py3Dmol, which will allow us to visualize our generations, and then importing necessary packages." |
| 32 | + "We begin by installing the [esm package](https://github.com/evolutionaryscale/esm) and py3Dmol, which will allow us to visualize our generations, and then importing necessary packages.\n" |
33 | 33 | ] |
34 | 34 | }, |
35 | 35 | { |
36 | 36 | "cell_type": "code", |
37 | | - "execution_count": null, |
| 37 | + "execution_count": 1, |
38 | 38 | "metadata": { |
39 | 39 | "id": "VgTZdaIMQ44H" |
40 | 40 | }, |
|
50 | 50 | }, |
51 | 51 | { |
52 | 52 | "cell_type": "code", |
53 | | - "execution_count": 13, |
| 53 | + "execution_count": 2, |
54 | 54 | "metadata": { |
55 | 55 | "id": "poK5NTaXRGcX" |
56 | 56 | }, |
|
75 | 75 | "id": "vmVYm2uQ7m-5" |
76 | 76 | }, |
77 | 77 | "source": [ |
78 | | - "\n", |
79 | | - "ESM3 is a frontier generative model for biology. It is scalable due to its ability to tokenize sequence, structure, and function and use a (nearly standard) transformer architecture while still being able to reason across all modalities simulateously.\n", |
| 78 | + "ESM3 is a frontier generative model for biology. It is scalable due to its ability to tokenize sequence, structure, and function and use a (nearly standard) transformer architecture while still being able to reason across all modalities simulateously.\n", |
80 | 79 | "\n", |
81 | 80 | "The largest ESM3 (98 billion parameters) was trained with 1.07e24 FLOPs on 2.78 billion proteins and 771 billion unique tokens. To create esmGFP we used the 7 billion parameter variant of ESM3. We'll use this model via the [EvolutionaryScale Forge](https://forge.evolutionaryscale.ai) API.\n", |
82 | 81 | "\n", |
83 | | - "Grab a token from [the Forge console](https://forge.evolutionaryscale.ai/console) and add it below. Note that your token is like a password for your account and you should take care to protect it. For this reason it is recommended to frequently create a new token and delete old, unused ones. It is also recommended to paste the token directly into an environment variable or use a utility like `getpass` as shown below so tokens are not accidentally shared or checked into code repositories." |
| 82 | + "Grab a token from [the Forge console](https://forge.evolutionaryscale.ai/console) and add it below. Note that your token is like a password for your account and you should take care to protect it. For this reason it is recommended to frequently create a new token and delete old, unused ones. It is also recommended to paste the token directly into an environment variable or use a utility like `getpass` as shown below so tokens are not accidentally shared or checked into code repositories.\n" |
84 | 83 | ] |
85 | 84 | }, |
86 | 85 | { |
87 | 86 | "cell_type": "code", |
88 | | - "execution_count": 14, |
| 87 | + "execution_count": 3, |
89 | 88 | "metadata": { |
90 | 89 | "id": "zNrU9Q2SYonX" |
91 | 90 | }, |
|
100 | 99 | "id": "9jIc4OZyh2oE" |
101 | 100 | }, |
102 | 101 | "source": [ |
103 | | - "We then create a model stub that behaves somewhat like a PyTorch model, but under the hood it sends the inputs to the Forge server, runs the inputs through the neural network weights on that remote server, and then returns the output tensors here in this notebook. This stub can also be used in the EvolutionaryScale SDK to simplify a lot of the operations around generation, folding, and generally using the sampling. This is important because iterative sampling is key to getting the best performance from ESM3, and the SDK manages a lot of the complexity around implementing these standard routines." |
| 102 | + "We then create a model stub that behaves somewhat like a PyTorch model, but under the hood it sends the inputs to the Forge server, runs the inputs through the neural network weights on that remote server, and then returns the output tensors here in this notebook. This stub can also be used in the EvolutionaryScale SDK to simplify a lot of the operations around generation, folding, and generally using the sampling. This is important because iterative sampling is key to getting the best performance from ESM3, and the SDK manages a lot of the complexity around implementing these standard routines.\n" |
104 | 103 | ] |
105 | 104 | }, |
106 | 105 | { |
107 | 106 | "cell_type": "code", |
108 | | - "execution_count": 26, |
| 107 | + "execution_count": 4, |
109 | 108 | "metadata": { |
110 | 109 | "id": "Tna_mjGOjdXA" |
111 | 110 | }, |
|
138 | 137 | "\n", |
139 | 138 | "Prompt engineering is a bit of an art and a bit of a science, so one typically needs to experiment to get a prompt that produces a desired result. Also because we use sampling to generate from the model the results of different generations from the same prompt will vary. Some prompts tend to have higher success rates requiring only a few generations to get a candidate protein design. Other more difficult prompts may require thousands of generations! The models are more controllable with alignment.\n", |
140 | 139 | "\n", |
141 | | - "The model we will be using is the raw pretrained (unaligned) model, but we've worked a lot on this prompt so one can typically get an interesting design with only a few generations." |
| 140 | + "The model we will be using is the raw pretrained (unaligned) model, but we've worked a lot on this prompt so one can typically get an interesting design with only a few generations.\n" |
142 | 141 | ] |
143 | 142 | }, |
144 | 143 | { |
|
147 | 146 | "id": "qtwnyA1BngWy" |
148 | 147 | }, |
149 | 148 | "source": [ |
150 | | - "We'll construct our prompt from fragments of the [1qy3](https://www.rcsb.org/structure/1qy3) sequence and structure from the PDB. The following code fetches data from the PDB and then uses ESM3's tokenizers to convert the sequence and structure to tokens that can be passed into the model. Once can see that both the amino acid type and the coordinates are converted into one discrete token per sequence position." |
| 149 | + "We'll construct our prompt from fragments of the [1qy3](https://www.rcsb.org/structure/1qy3) sequence and structure from the PDB. The following code fetches data from the PDB and then uses ESM3's tokenizers to convert the sequence and structure to tokens that can be passed into the model. Once can see that both the amino acid type and the coordinates are converted into one discrete token per sequence position.\n" |
151 | 150 | ] |
152 | 151 | }, |
153 | 152 | { |
|
161 | 160 | "template_gfp = ESMProtein.from_protein_chain(\n", |
162 | 161 | " ProteinChain.from_rcsb(\"1qy3\", chain_id=\"A\")\n", |
163 | 162 | ")\n", |
164 | | - "\n", |
165 | 163 | "template_gfp_tokens = model.encode(template_gfp)\n", |
166 | 164 | "\n", |
167 | 165 | "print(\"Sequence tokens:\")\n", |
|
183 | 181 | "source": [ |
184 | 182 | "We'll now build a prompt. Specifically we'll specify 4 amino acid identities at positions near where we want the chromophore to form, and 2 amino acid identities on the beta barrel that are known to support chromophore formation.\n", |
185 | 183 | "\n", |
186 | | - "Furthermore we'll specify the structure should be similar to the 1qy3 structure at all these positions by adding tokens from the encoded 1qy3 structure to the structure track of our prompt. We'll also specify a few more positions (along the alpha helix kink)." |
| 184 | + "Furthermore we'll specify the structure should be similar to the 1qy3 structure at all these positions by adding tokens from the encoded 1qy3 structure to the structure track of our prompt. We'll also specify a few more positions (along the alpha helix kink).\n" |
187 | 185 | ] |
188 | 186 | }, |
189 | 187 | { |
|
229 | 227 | "source": [ |
230 | 228 | "The output shows the original 1qy3 sequence and the our prompt sequence track amino acid identities and the positions that have a token on the structure track. ESM3 will then be tasked with filling in the structure and sequence at the remaining masked (underscore) positions.\n", |
231 | 229 | "\n", |
232 | | - "One small note, we introduced the mutation A93R in our prompt. This isn't a mistake. Using Alanine at this position causes the chromophore to mature extremely slowly (which is how we are able to measure the precyclized structure of GFP!). However we don't want to wait around for our GFPs to glow so we go with Arginine at this position." |
| 230 | + "One small note, we introduced the mutation A93R in our prompt. This isn't a mistake. Using Alanine at this position causes the chromophore to mature extremely slowly (which is how we are able to measure the precyclized structure of GFP!). However we don't want to wait around for our GFPs to glow so we go with Arginine at this position.\n" |
233 | 231 | ] |
234 | 232 | }, |
235 | 233 | { |
|
256 | 254 | "source": [ |
257 | 255 | "%%time\n", |
258 | 256 | "\n", |
259 | | - "num_tokens_to_decode = (prompt.structure == 4096).sum().item()\n", |
| 257 | + "num_tokens_to_decode = min((prompt.structure == 4096).sum().item(), 20)\n", |
| 258 | + "\n", |
260 | 259 | "\n", |
261 | 260 | "structure_generation = model.generate(\n", |
262 | 261 | " prompt,\n", |
|
276 | 275 | ")\n", |
277 | 276 | "\n", |
278 | 277 | "# Decodes structure tokens to backbone coordinates.\n", |
279 | | - "structure_generation_protein = model.decode(structure_generation)\n", |
280 | | - "\n", |
281 | | - "print(\"\")" |
| 278 | + "structure_generation_protein = model.decode(structure_generation)" |
282 | 279 | ] |
283 | 280 | }, |
284 | 281 | { |
|
287 | 284 | "id": "0HARel94tJfI" |
288 | 285 | }, |
289 | 286 | "source": [ |
290 | | - "Now let's visualize our generated structure. This will probably look like the familiar GFP beta barrel around an alpha helix." |
| 287 | + "Now let's visualize our generated structure. This will probably look like the familiar GFP beta barrel around an alpha helix.\n" |
291 | 288 | ] |
292 | 289 | }, |
293 | 290 | { |
|
316 | 313 | "source": [ |
317 | 314 | "At this point we only want to continue the generation if this design is a close match to a wildtype GFP at the active site, has some structural difference across the full protein (otherwise it would end up being very sequence-similar to wildtype GFP), and overall still looks like the classic GFP alpha helix in a beta barrel structure.\n", |
318 | 315 | "\n", |
319 | | - "Of course when generating many designs we cannot look at each one manually, so we adopt some automated rejection sampling criteria based on the overall structure RMSD and the constrained site RMSD for our generated structure being faithful to the prompt. If these checks pass then we'll try to design a sequence for this structure. If not, one should go back up a few cells and design another structure until it passes these computational screens. (Or not... this is your GFP design!)" |
| 316 | + "Of course when generating many designs we cannot look at each one manually, so we adopt some automated rejection sampling criteria based on the overall structure RMSD and the constrained site RMSD for our generated structure being faithful to the prompt. If these checks pass then we'll try to design a sequence for this structure. If not, one should go back up a few cells and design another structure until it passes these computational screens. (Or not... this is your GFP design!)\n" |
320 | 317 | ] |
321 | 318 | }, |
322 | 319 | { |
|
354 | 351 | "\n", |
355 | 352 | "Now we have a backbone with some structural variation but that also matches the GFP constrained site, and we want to design a sequence that folds to this structure. We can use the prior generation, which is exactly our original prompt plus the new structure tokens representing the backbone, to prompt ESM3 again.\n", |
356 | 353 | "\n", |
357 | | - "One we have designed a sequence we'll want to confirm that sequence is a match for our structure, so we'll remove all other conditioning from the prompt and fold the sequence. Conveniently with ESM3, folding a sequence is simply generating a set of structure tokens conditioned on the amino acid sequence. In this case we want the model's highest confidence generation (with no diversity) so we sample with a temperature of zero." |
| 354 | + "One we have designed a sequence we'll want to confirm that sequence is a match for our structure, so we'll remove all other conditioning from the prompt and fold the sequence. Conveniently with ESM3, folding a sequence is simply generating a set of structure tokens conditioned on the amino acid sequence. In this case we want the model's highest confidence generation (with no diversity) so we sample with a temperature of zero.\n" |
358 | 355 | ] |
359 | 356 | }, |
360 | 357 | { |
|
394 | 391 | "id": "v_zK7TDCzEX3" |
395 | 392 | }, |
396 | 393 | "source": [ |
397 | | - "We now have a candidate GFP sequence!" |
| 394 | + "We now have a candidate GFP sequence!\n" |
398 | 395 | ] |
399 | 396 | }, |
400 | 397 | { |
|
414 | 411 | "id": "LBvQYpR_zQAK" |
415 | 412 | }, |
416 | 413 | "source": [ |
417 | | - "We can align this sequence against the original template to see how similar it is to avGFP. One might also want to search against all known fluorescent proteins to assess the novelty of this potential GFP." |
| 414 | + "We can align this sequence against the original template to see how similar it is to avGFP. One might also want to search against all known fluorescent proteins to assess the novelty of this potential GFP.\n" |
418 | 415 | ] |
419 | 416 | }, |
420 | 417 | { |
|
455 | 452 | "source": [ |
456 | 453 | "We now recheck our computational metrics for the constrained site. If we see the constrained site is not a match then we'd want to try designing the sequence again. If many attempts to design a sequence that matches the structure fail, then it's likely the structure is not easily designable and we may want to reject this structure generation as well!\n", |
457 | 454 | "\n", |
458 | | - "At this point the backbone RMSD doesn't matter very much to us, so long as the sequence is adequately distant to satisfy our scientific curiosity!" |
| 455 | + "At this point the backbone RMSD doesn't matter very much to us, so long as the sequence is adequately distant to satisfy our scientific curiosity!\n" |
459 | 456 | ] |
460 | 457 | }, |
461 | 458 | { |
|
487 | 484 | "id": "0cIeC4Lg1Bz9" |
488 | 485 | }, |
489 | 486 | "source": [ |
490 | | - "An now we can visualize the final structure prediction of our candidate GFP design." |
| 487 | + "An now we can visualize the final structure prediction of our candidate GFP design.\n" |
491 | 488 | ] |
492 | 489 | }, |
493 | 490 | { |
|
511 | 508 | "id": "VrNuZHeHRWuP" |
512 | 509 | }, |
513 | 510 | "source": [ |
514 | | - "Before considering this sequence for wet lab validation, we run a joint optimization of the sequence and structure. The outputs of that process are then passed through stringent computational filters and then many designs from many starting points are ranked by a number of computational scores to select the final designs sent for testing. We'll walk through that process in a different notebook." |
| 511 | + "Before considering this sequence for wet lab validation, we run a joint optimization of the sequence and structure. The outputs of that process are then passed through stringent computational filters and then many designs from many starting points are ranked by a number of computational scores to select the final designs sent for testing. We'll walk through that process in a different notebook.\n" |
515 | 512 | ] |
516 | 513 | }, |
517 | 514 | { |
|
520 | 517 | "id": "c3jSQrJa1Tfi" |
521 | 518 | }, |
522 | 519 | "source": [ |
523 | | - "If you've made it this far it's worth noting that this isn't the only method to prompt ESM3 to design a GFP, it's just the one we used to report the successful generation of esmGFP in our paper. We hope you'll try different techniques to generate from ESM3. We're interested to hear what works for you!" |
| 520 | + "If you've made it this far it's worth noting that this isn't the only method to prompt ESM3 to design a GFP, it's just the one we used to report the successful generation of esmGFP in our paper. We hope you'll try different techniques to generate from ESM3. We're interested to hear what works for you!\n" |
524 | 521 | ] |
525 | 522 | } |
526 | 523 | ], |
|
543 | 540 | "name": "python", |
544 | 541 | "nbconvert_exporter": "python", |
545 | 542 | "pygments_lexer": "ipython3", |
546 | | - "version": "3.10.0" |
| 543 | + "version": "3.11.11" |
547 | 544 | } |
548 | 545 | }, |
549 | 546 | "nbformat": 4, |
|
0 commit comments