You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a major release dropping the C++ implementation in favor of a
Rust implementation. Using this library in production for more than year
has raised multiple concerns. C++ concurrency model has proven to be
hard when using libgit2 and shown many exceptions and race-conditions.
Nontheless, C++ shown problems with unicode strings and performance
degradation. Using Rust ended up being more performent, safe, and easy
to develop and maintain.
- Replaced the C++ implementation with Rust
- Tiny changes in the API. Rule adding functions dropped the `_regex`
prefix from their parameters.
- The package now ships binary packages (wheels)
- More performance improvements like avoiding scanning empty files.
- File path and extensions skipping rules now being compared lowercased
A Git Repository Leaks Scanner Python library written in C++
6
+
A Git Repository Secrets Scanner written in Rust
7
7
</h3>
8
8
</p>
9
9
@@ -19,7 +19,6 @@
19
19
-[Built With](#built-with)
20
20
-[Performance](#performance)
21
21
-[CPU](#cpu)
22
-
-[Prerequisites](#prerequisites)
23
22
-[Installation](#installation)
24
23
-[Documentation](#documentation)
25
24
-[Usage](#usage)
@@ -29,37 +28,25 @@
29
28
30
29
## About The Project
31
30
32
-
PyRepScan is a python library written in C++. The library uses [libgit2](https://github.com/libgit2/libgit2) for repository parsing and traversing, [re2](https://github.com/google/re2) for regex pattern matching and [taskflow](https://github.com/taskflow/taskflow) for concurrency. The library was written to achieve high performance and python bindings.
31
+
PyRepScan is a python library written in Rust. The library uses [git2-rs](https://github.com/rust-lang/git2-rs) for repository parsing and traversing, [regex](https://github.com/rust-lang/regex) for regex pattern matching and [rayon](https://github.com/rayon-rs/rayon) for concurrency. The library was written to achieve high performance and python bindings.
@@ -82,62 +69,62 @@ This class holds all the added rules for fast reuse.
82
69
def add_content_rule(
83
70
self,
84
71
name: str,
85
-
regex_pattern: str,
86
-
whitelist_regex_patterns: typing.List[str],
87
-
blacklist_regex_patterns: typing.List[str],
72
+
pattern: str,
73
+
whitelist_patterns: typing.List[str],
74
+
blacklist_patterns: typing.List[str],
88
75
) ->None
89
76
```
90
77
The `add_content_rule` function adds a new rule to an internal list of rules that could be reused multiple times against different repositories. The same name can be used multiple times and would lead to results which can hold the same name. Content rule means that the regex pattern would be tested against the content of the files.
91
78
-`name`- The name of the rule so it can be identified.
92
-
-`regex_pattern`- The regex pattern (RE2 syntax) to match against the content of the commited files.
93
-
-`whitelist_regex_patterns`- A list of regex patterns (RE2 syntax) to match against the content of the committed file to filterin results. Only one of the patterns should be matched to pass through the result. There is an OR relation between the patterns.
94
-
-`blacklist_regex_patterns`- A list of regex patterns (RE2 syntax) to match against the content of the committed file to filter out results. Only one of the patterns should be matched to omit the result. There is an OR relation between the patterns.
79
+
-`pattern`- The regex pattern (Rust Regex syntax) to match against the content of the commited files.
80
+
-`whitelist_patterns`- A list of regex patterns (Rust Regex syntax) to match against the content of the committed file to filterin results. Only one of the patterns should be matched to pass through the result. There is an OR relation between the patterns.
81
+
-`blacklist_patterns`- A list of regex patterns (Rust Regex syntax) to match against the content of the committed file to filter out results. Only one of the patterns should be matched to omit the result. There is an OR relation between the patterns.
95
82
96
83
97
84
```python
98
-
defadd_file_name_rule(
85
+
defadd_file_path_rule(
99
86
self,
100
87
name: str,
101
-
regex_pattern: str,
88
+
pattern: str,
102
89
) ->None
103
90
```
104
-
The `add_file_name_rule` function adds a new rule to an internal list of rules that could be reused multiple times against different repositories. The same name can be used multiple times and would lead to results which can hold the same name. File name rule means that the regex pattern would be tested against the filenames.
91
+
The `add_file_path_rule` function adds a new rule to an internal list of rules that could be reused multiple times against different repositories. The same name can be used multiple times and would lead to results which can hold the same name. File name rule means that the regex pattern would be tested against the filepaths.
105
92
-`name`- The name of the rule so it can be identified.
106
-
-`regex_pattern`- The regex pattern (RE2syntax) to match against the filenames of the commited files.
93
+
-`pattern`- The regex pattern (Rust Regex syntax) to match against the filepaths of the commited files.
107
94
108
95
109
96
```python
110
-
defadd_ignored_file_extension(
97
+
defadd_file_extension_to_skip(
111
98
self,
112
99
file_extension: str,
113
100
) ->None
114
101
```
115
-
The `add_ignored_file_extension` function adds a new file extension to the filtering phase to reduce the amount of inspected files and to increase the performance of the scan.
102
+
The `add_file_extension_to_skip` function adds a new file extension to the filtering phase to reduce the amount of inspected files and to increase the performance of the scan.
116
103
-`file_extension`- A file extension, without a leading dot, to filter out from the scan.
117
104
118
105
119
106
```python
120
-
defadd_ignored_file_path(
107
+
defadd_file_path_to_skip(
121
108
self,
122
109
file_path: str,
123
110
) ->None
124
111
```
125
-
The `add_ignored_file_path` function adds a new file pattern to the filtering phase to reduce the amount of inspected files and to increase the performance of the scan. Every file path that would include the `file_path` substring would be left out of the scanned files.
112
+
The `add_file_path_to_skip` function adds a new file path pattern to the filtering phase to reduce the amount of inspected files and to increase the performance of the scan. Every file path that would include the `file_path` substring would be left out of the scanned files.
126
113
-`file_path`- If the inspected file path would include this substring, it won't be scanned. This parameter is a free text.
127
114
128
115
129
116
```python
130
117
def scan(
131
118
self,
132
119
repository_path: str,
133
-
branch_glob_pattern: '*',
134
-
from_timestamp: int=0,
120
+
branch_glob_pattern: typing.Optional[str],
121
+
from_timestamp: typing.Optional[int],
135
122
) -> typing.List[typing.Dict[str, str]]
136
123
```
137
124
The `scan` function is the main function in the library. Calling this function would trigger a new scan that would return a list of matches. The scan function is a multithreaded operation, that would utilize all the available core in the system. The results would not include the file content but only the regex matching group. To retrieve the full file content one should take the `results['oid']`and to call `get_file_content` function.
138
125
-`repository_path`- The git repository folder path.
139
-
-`branch_glob_pattern`- A glob pattern to filter branches for the scan.
140
-
-`from_timestamp`- A UTC timestamp (Int) that only commits that were created after this timestamp would be included in the scan.
126
+
-`branch_glob_pattern`- A glob pattern to filter branches for the scan. If Noneis sent, defaults to `*`.
127
+
-`from_timestamp`- A UTC timestamp (Int) that only commits that were created after this timestamp would be included in the scan. If Noneis sent, defaults to `0`.
141
128
142
129
A sample result would look like this:
143
130
```python
@@ -157,6 +144,7 @@ A sample result would look like this:
0 commit comments