Skip to content

Commit f2d457d

Browse files
authored
Merge pull request #18145 from github/tausbn/python-add-guide-for-extending-the-parser
Python: Add guide describing how to extend the parser
2 parents b7792d6 + a9817a0 commit f2d457d

File tree

1 file changed

+94
-0
lines changed

1 file changed

+94
-0
lines changed
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# How to update the Python parser
2+
3+
## Step 1: Add an extractor test
4+
5+
Extractor parser tests live in the `tests/parser` directory. There are two different kinds of tests
6+
7+
- Tests that compare the behavior of the old (Python-based) and new (`tree-sitter`-based) parsers, and verify that they yield the same output on the given source files, and
8+
- Tests that compare the output of the parser (old or new) against a fixed `.expected` file.
9+
10+
What kind of test is run is determined based on the file name.
11+
If it ends in either `_new.py` or `_old.py`, then the test is run against an `.expected` file. If not, it is used to compare old against new.
12+
13+
In most cases when adding new features, you'll only be interested in modifying the new parser (the old one is mostly there for legacy reasons).
14+
Thus, you will almost certainly want to create a test that ends in `_new.py`.
15+
16+
It's a good habit to start by adding the parser test, as this makes it more easy to test when various bits of the parser have been added/modified successfully.
17+
18+
The rest of this document will only concern itself with the process of extending the _new_ parser.
19+
20+
To actually _run_ the tests, the easiest way is to use `pytest`.
21+
In the main `extractor` directory (i.e. where this file is located) run
22+
23+
```sh
24+
pytest tests/test_parser.py
25+
```
26+
27+
and wait for the tests to complete. It is normal and expected that the test seemingly freezes on the first run.
28+
This is simply because the `tsg-python` Rust binary is being built in the background.
29+
30+
Once you have added a new test (or modified an old one) and start making modifications to the parser itself, it quickly becomes tedious to run _all_ the parser tests.
31+
To run just a single test using `pytest`, use the `::` syntax to specify specific tests.
32+
For instance, if you want to just run the tests associated with the file `types_new.py`, you would write
33+
34+
```sh
35+
pytest tests/test_parser.py::ParserTest::test_types_new
36+
```
37+
38+
## Step 2: Extend the `tree-sitter-python` grammar
39+
40+
The new parser is based on `tree-sitter`, so the first task is to extend the existing `tree-sitter-python` grammar.
41+
This grammar can be found in the `grammar.js` file in the `tsg-python/tsp` subdirectory of the extractor directory.
42+
43+
Note that whenever changes are made to `grammar.js`, you must regenerate the parser files by running
44+
45+
```sh
46+
tree-sitter generate
47+
```
48+
49+
inside the `tsp` directory.
50+
You'll need to install the `tree-sitter` CLI in order to run this command.
51+
One way to install it is to use `cargo`:
52+
53+
```sh
54+
cargo install tree-sitter-cli
55+
```
56+
57+
(This presupposes you have `cargo` available, but you'll need this anyway when compiling `tsg-python`.)
58+
59+
Once the parser files have been regenerated, they'll get picked up automatically when `tsg-python` is rebuilt.
60+
61+
> Pro-tip: When you're done with your parser changes, and go to commit these to a branch, put the autogenerated files in their own commit.
62+
> This makes it easier to review the changes, and if you need to go back and regenerate the files again, it's easy to modify just that commit.
63+
64+
Once you have extended `grammar.js` and regenerated the parser files, you should be able to check that the grammar changes are sufficient by rerunning the parser test using `pytest`. If it fails while producing an AST that doesn't make sense, then you're probably on the right track. If it fails _without_ producing an AST, then something went wrong with the actual `tree-sitter` parse. To check if this is the case, you can run
65+
66+
```sh
67+
tree-sitter parse path/to/test.py
68+
```
69+
70+
and see what kind of errors are emitted (possibly as `ERROR` or `MISSING` nodes in the AST that is output).
71+
72+
## Step 3: Extend `python.tsg`
73+
74+
Once the grammar has been extended, we need to also tell `tsg-python` how to turn the `tree-sitter-python` AST into something that better matches the AST structure that we use in the Python extractor.
75+
76+
For an introduction to the language of `tree-sitter-graph` (and in particular how we use it in the Python extractor), see the `README.md` file in the `tsg-python` directory.
77+
78+
## Step 4: Extend the set of known AST nodes
79+
80+
If you added new node types, or added fields to known node types, then you'll need to update a few files in the Python extractor before it is able to reconstruct the output from `tsg-python`.
81+
82+
New AST nodes should be added in two places: `master.py` and `semmle/python/ast.py`. The former of these is used to automatically generate the Python dbscheme and `AstGenerated.qll`. The latter is what the parser actually uses as its internal representation of the AST.
83+
84+
## Step 5: Rebuild the autogenerated AST and database scheme
85+
86+
If you made changes to `master.py`, you'll need to regenerate a couple of files. This can be done from within the `extractor` directory using the `make dbscheme` and `make ast` commands. Note that for the latter, you need a copy of the CodeQL CLI present, as it is used to autoformat the `AstGenerated.qll` file.
87+
88+
## Step 6: Add dbscheme upgrade and downgrade scripts
89+
90+
If you ended up making changes to the database scheme inw step 5, then you'll need to add an appropriate pair of up- and downgrade scripts to handle any changes between the different versions of the dbscheme.
91+
92+
This can be a bit fiddly, but luckily there are [tools](https://github.com/github/codeql/blob/main/misc/scripts/prepare-db-upgrade.sh) that can help set up some of the necessary files for you.
93+
94+
See also the [guide](https://github.com/github/codeql/blob/main/docs/prepare-db-upgrade.md) for preparing database upgrades.

0 commit comments

Comments
 (0)