|
| 1 | +# How to update the Python parser |
| 2 | + |
| 3 | +## Step 1: Add an extractor test |
| 4 | + |
| 5 | +Extractor parser tests live in the `tests/parser` directory. There are two different kinds of tests |
| 6 | + |
| 7 | +- Tests that compare the behavior of the old (Python-based) and new (`tree-sitter`-based) parsers, and verify that they yield the same output on the given source files, and |
| 8 | +- Tests that compare the output of the parser (old or new) against a fixed `.expected` file. |
| 9 | + |
| 10 | +What kind of test is run is determined based on the file name. |
| 11 | +If it ends in either `_new.py` or `_old.py`, then the test is run against an `.expected` file. If not, it is used to compare old against new. |
| 12 | + |
| 13 | +In most cases when adding new features, you'll only be interested in modifying the new parser (the old one is mostly there for legacy reasons). |
| 14 | +Thus, you will almost certainly want to create a test that ends in `_new.py`. |
| 15 | + |
| 16 | +It's a good habit to start by adding the parser test, as this makes it more easy to test when various bits of the parser have been added/modified successfully. |
| 17 | + |
| 18 | +The rest of this document will only concern itself with the process of extending the _new_ parser. |
| 19 | + |
| 20 | +To actually _run_ the tests, the easiest way is to use `pytest`. |
| 21 | +In the main `extractor` directory (i.e. where this file is located) run |
| 22 | + |
| 23 | +```sh |
| 24 | +pytest tests/test_parser.py |
| 25 | +``` |
| 26 | + |
| 27 | +and wait for the tests to complete. It is normal and expected that the test seemingly freezes on the first run. |
| 28 | +This is simply because the `tsg-python` Rust binary is being built in the background. |
| 29 | + |
| 30 | +Once you have added a new test (or modified an old one) and start making modifications to the parser itself, it quickly becomes tedious to run _all_ the parser tests. |
| 31 | +To run just a single test using `pytest`, use the `::` syntax to specify specific tests. |
| 32 | +For instance, if you want to just run the tests associated with the file `types_new.py`, you would write |
| 33 | + |
| 34 | +```sh |
| 35 | +pytest tests/test_parser.py::ParserTest::test_types_new |
| 36 | +``` |
| 37 | + |
| 38 | +## Step 2: Extend the `tree-sitter-python` grammar |
| 39 | + |
| 40 | +The new parser is based on `tree-sitter`, so the first task is to extend the existing `tree-sitter-python` grammar. |
| 41 | +This grammar can be found in the `grammar.js` file in the `tsg-python/tsp` subdirectory of the extractor directory. |
| 42 | + |
| 43 | +Note that whenever changes are made to `grammar.js`, you must regenerate the parser files by running |
| 44 | + |
| 45 | +```sh |
| 46 | +tree-sitter generate |
| 47 | +``` |
| 48 | + |
| 49 | +inside the `tsp` directory. |
| 50 | +You'll need to install the `tree-sitter` CLI in order to run this command. |
| 51 | +One way to install it is to use `cargo`: |
| 52 | + |
| 53 | +```sh |
| 54 | +cargo install tree-sitter-cli |
| 55 | +``` |
| 56 | + |
| 57 | +(This presupposes you have `cargo` available, but you'll need this anyway when compiling `tsg-python`.) |
| 58 | + |
| 59 | +Once the parser files have been regenerated, they'll get picked up automatically when `tsg-python` is rebuilt. |
| 60 | + |
| 61 | +> Pro-tip: When you're done with your parser changes, and go to commit these to a branch, put the autogenerated files in their own commit. |
| 62 | +> This makes it easier to review the changes, and if you need to go back and regenerate the files again, it's easy to modify just that commit. |
| 63 | +
|
| 64 | +Once you have extended `grammar.js` and regenerated the parser files, you should be able to check that the grammar changes are sufficient by rerunning the parser test using `pytest`. If it fails while producing an AST that doesn't make sense, then you're probably on the right track. If it fails _without_ producing an AST, then something went wrong with the actual `tree-sitter` parse. To check if this is the case, you can run |
| 65 | + |
| 66 | +```sh |
| 67 | +tree-sitter parse path/to/test.py |
| 68 | +``` |
| 69 | + |
| 70 | +and see what kind of errors are emitted (possibly as `ERROR` or `MISSING` nodes in the AST that is output). |
| 71 | + |
| 72 | +## Step 3: Extend `python.tsg` |
| 73 | + |
| 74 | +Once the grammar has been extended, we need to also tell `tsg-python` how to turn the `tree-sitter-python` AST into something that better matches the AST structure that we use in the Python extractor. |
| 75 | + |
| 76 | +For an introduction to the language of `tree-sitter-graph` (and in particular how we use it in the Python extractor), see the `README.md` file in the `tsg-python` directory. |
| 77 | + |
| 78 | +## Step 4: Extend the set of known AST nodes |
| 79 | + |
| 80 | +If you added new node types, or added fields to known node types, then you'll need to update a few files in the Python extractor before it is able to reconstruct the output from `tsg-python`. |
| 81 | + |
| 82 | +New AST nodes should be added in two places: `master.py` and `semmle/python/ast.py`. The former of these is used to automatically generate the Python dbscheme and `AstGenerated.qll`. The latter is what the parser actually uses as its internal representation of the AST. |
| 83 | + |
| 84 | +## Step 5: Rebuild the autogenerated AST and database scheme |
| 85 | + |
| 86 | +If you made changes to `master.py`, you'll need to regenerate a couple of files. This can be done from within the `extractor` directory using the `make dbscheme` and `make ast` commands. Note that for the latter, you need a copy of the CodeQL CLI present, as it is used to autoformat the `AstGenerated.qll` file. |
| 87 | + |
| 88 | +## Step 6: Add dbscheme upgrade and downgrade scripts |
| 89 | + |
| 90 | +If you ended up making changes to the database scheme inw step 5, then you'll need to add an appropriate pair of up- and downgrade scripts to handle any changes between the different versions of the dbscheme. |
| 91 | + |
| 92 | +This can be a bit fiddly, but luckily there are [tools](https://github.com/github/codeql/blob/main/misc/scripts/prepare-db-upgrade.sh) that can help set up some of the necessary files for you. |
| 93 | + |
| 94 | +See also the [guide](https://github.com/github/codeql/blob/main/docs/prepare-db-upgrade.md) for preparing database upgrades. |
0 commit comments