Skip to content

Commit a3f0625

Browse files
author
Gal Ben David
committed
- added from_timestamp parameter to allow incremental scanning as needed
in CI pipelines. this parameter would help to reduce the time it takes to scan the whole repository. - README.md file now includes a full documentation.
1 parent 897afa4 commit a3f0625

File tree

4 files changed

+163
-6
lines changed

4 files changed

+163
-6
lines changed

README.md

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
- [CPU](#cpu)
2222
- [Prerequisites](#prerequisites)
2323
- [Installation](#installation)
24+
- [Documentation](#documentation)
2425
- [Usage](#usage)
2526
- [License](#license)
2627
- [Contact](#contact)
@@ -66,6 +67,93 @@ pip3 install PyRepScan
6667
```
6768

6869

70+
## Documentation
71+
72+
```python
73+
class GitRepositoryScanner:
74+
def __init__(
75+
self,
76+
) -> None
77+
```
78+
This class holds all the added rules for fast reuse.
79+
80+
81+
```python
82+
def add_rule(
83+
self,
84+
name: str,
85+
match_pattern: str,
86+
match_whitelist_patterns: typing.List[str],
87+
match_blacklist_patterns: typing.List[str],
88+
) -> None
89+
```
90+
The `add_rule` function adds a new rule to an internal list of rules that could be reused multiple times against different repositories. The same name can be used multiple times and would lead to results which can hold the same name.
91+
- `name` - The name of the rule so it can be identified.
92+
- `match_pattern` - The regex pattern (RE2 syntax) to match against the content of the commited files.
93+
- `match_whitelist_patterns` - A list of regex patterns (RE2 syntax) to match against the content of the committed file to filter in results. Only one of the patterns should be matched to pass through the result. There is an OR relation between the patterns.
94+
- `match_blacklist_patterns` - A list of regex patterns (RE2 syntax) to match against the content of the committed file to filter out results. Only one of the patterns should be matched to omit the result. There is an OR relation between the patterns.
95+
96+
97+
```python
98+
def add_ignored_file_extension(
99+
self,
100+
file_extension: str,
101+
) -> None
102+
```
103+
The `add_ignored_file_extension` function adds a new file extension to the filtering phase to reduce the amount of inspected files and to increase the performance of the scan.
104+
- `file_extension` - A file extension, without a leading dot, to filter out from the scan.
105+
106+
107+
```python
108+
def add_ignored_file_path(
109+
self,
110+
file_path: str,
111+
) -> None
112+
```
113+
The `add_ignored_file_path` function adds a new file pattern to the filtering phase to reduce the amount of inspected files and to increase the performance of the scan. Every file path that would include the `file_path` substring would be left out of the scanned files.
114+
- `file_path` - If the inspected file path would include this substring, it won't be scanned. This parameter is a free text.
115+
116+
117+
```python
118+
def scan(
119+
self,
120+
repository_path: str,
121+
branch_glob_pattern: '*',
122+
from_timestamp: int,
123+
) -> typing.List[typing.Dict[str, str]]
124+
```
125+
The `scan` function is the main function in the library. Calling this function would trigger a new scan that would return a list of matches. The scan function is a multithreaded operation, that would utilize all the available core in the system. The results would not include the file content but only the regex matching group. To retrieve the full file content one should take the `results['oid']` and to call `get_file_content` function.
126+
- `repository_path` - The git repository folder path.
127+
- `branch_glob_pattern` - A glob pattern to filter branches for the scan.
128+
- `from_timestamp` - A UTC timestamp (Int) that only commits that were created after this timestamp would be included in the scan.
129+
130+
A sample result would look like this:
131+
```python
132+
{
133+
'rule_name': 'First Rule',
134+
'author_email': '[email protected]',
135+
'author_name': 'Author Name',
136+
'commit_id': '1111111111111111111111111111111111111111',
137+
'commit_message': 'The commit message',
138+
'commit_time': '2020-01-01T00:00:00e',
139+
'file_path': 'full/file/path',
140+
'file_oid': '47d2739ba2c34690248c8f91b84bb54e8936899a',
141+
'match': 'The matched group',
142+
}
143+
```
144+
145+
146+
```python
147+
def get_file_content(
148+
repository_path: str,
149+
file_oid: str,
150+
) -> bytes
151+
```
152+
The `get_file_content` function exists to retrieve the content of a file that was previously matched. The full file content is omitted from the results to reduce the results list size and to deliver better performance.
153+
- `repository_path` - The git repository folder path.
154+
- `file_oid` - A string representing the file oid. This parameter exists in the results dictionary returned by the `scan` function.
155+
156+
69157
## Usage
70158

71159
```python
@@ -101,6 +189,7 @@ grs.add_ignored_file_path(
101189
results = grs.scan(
102190
repository_path='/repository/path',
103191
branch_glob_pattern='*',
192+
from_timestamp=0,
104193
)
105194

106195
# Results is a list of dicts. Each dict is in the following format:

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55

66
setuptools.setup(
77
name='PyRepScan',
8-
version='0.4.1',
8+
version='0.5.0',
99
author='Gal Ben David',
1010
author_email='[email protected]',
1111
url='https://github.com/intsights/PyRepScan',

src/git_repository_scanner.cpp

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,8 @@ class GitRepositoryScanner {
4747

4848
std::vector<git_oid> get_oids(
4949
git_repository * git_repo,
50-
std::string branch_glob_pattern
50+
std::string branch_glob_pattern,
51+
std::int64_t from_timestamp
5152
) {
5253
git_revwalk * repo_revwalk = nullptr;
5354
git_oid oid;
@@ -57,8 +58,14 @@ class GitRepositoryScanner {
5758
git_revwalk_sorting(repo_revwalk, GIT_SORT_TIME);
5859
git_revwalk_push_glob(repo_revwalk, branch_glob_pattern.c_str());
5960

61+
git_commit * current_commit = nullptr;
6062
while (git_revwalk_next(&oid, repo_revwalk) == 0) {
61-
oids.push_back(oid);
63+
git_commit_lookup(&current_commit, git_repo, &oid);
64+
git_time_t commit_time = git_commit_time(current_commit);
65+
if (commit_time >= from_timestamp) {
66+
oids.push_back(oid);
67+
}
68+
git_commit_free(current_commit);
6269
}
6370

6471
git_revwalk_free(repo_revwalk);
@@ -175,7 +182,8 @@ class GitRepositoryScanner {
175182

176183
std::vector<std::map<std::string, std::string>> scan(
177184
std::string repository_path,
178-
std::string branch_glob_pattern
185+
std::string branch_glob_pattern,
186+
std::int64_t from_timestamp
179187
) {
180188
git_repository * git_repo = nullptr;
181189
if (0 != git_repository_open(&git_repo, repository_path.c_str())) {
@@ -185,7 +193,8 @@ class GitRepositoryScanner {
185193
std::vector<std::map<std::string, std::string>> results;
186194
std::vector<git_oid> oids = this->get_oids(
187195
git_repo,
188-
branch_glob_pattern
196+
branch_glob_pattern,
197+
from_timestamp
189198
);
190199

191200
tf::Taskflow taskflow;
@@ -302,7 +311,8 @@ PYBIND11_MODULE(pyrepscan, m) {
302311
&GitRepositoryScanner::scan,
303312
"Scan a repository for secrets",
304313
pybind11::arg("repository_path"),
305-
pybind11::arg("branch_glob_pattern")
314+
pybind11::arg("branch_glob_pattern"),
315+
pybind11::arg("from_timestamp")
306316
)
307317
.def(
308318
"add_rule",

tests/test_pyrepscan.py

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
import unittest
22
import tempfile
33
import git
4+
import datetime
45

56
import pyrepscan
67

@@ -468,6 +469,7 @@ def test_scan(
468469
results = grs.scan(
469470
repository_path=tmpdir,
470471
branch_glob_pattern='*master',
472+
from_timestamp=0,
471473
)
472474
for result in results:
473475
result.pop('commit_id')
@@ -510,6 +512,7 @@ def test_scan(
510512
results = grs.scan(
511513
repository_path=tmpdir,
512514
branch_glob_pattern='*',
515+
from_timestamp=0,
513516
)
514517
for result in results:
515518
result.pop('commit_id')
@@ -580,3 +583,58 @@ def test_scan(
580583
file_oid='6b584e8ece562ebffc15d38808cd6b98fc3d97ea',
581584
),
582585
)
586+
587+
results = grs.scan(
588+
repository_path=tmpdir,
589+
branch_glob_pattern='*',
590+
from_timestamp=int(
591+
datetime.datetime(
592+
year=2004,
593+
month=1,
594+
day=1,
595+
hour=0,
596+
minute=0,
597+
second=0,
598+
tzinfo=datetime.timezone.utc,
599+
).timestamp()
600+
),
601+
)
602+
for result in results:
603+
result.pop('commit_id')
604+
self.assertCountEqual(
605+
first=results,
606+
second=[
607+
{
608+
'author_email': '[email protected]',
609+
'author_name': 'Author Name',
610+
'commit_message': 'edited file in non_merged_branch',
611+
'commit_time': '2004-01-01T00:00:00',
612+
'file_oid': '057032a2108721ad1de6a9240fd1a8f45bc3f2ef',
613+
'file_path': 'file.txt',
614+
'match': 'content',
615+
'rule_name': 'First Rule'
616+
},
617+
],
618+
)
619+
620+
results = grs.scan(
621+
repository_path=tmpdir,
622+
branch_glob_pattern='*',
623+
from_timestamp=int(
624+
datetime.datetime(
625+
year=2004,
626+
month=1,
627+
day=1,
628+
hour=0,
629+
minute=0,
630+
second=1,
631+
tzinfo=datetime.timezone.utc,
632+
).timestamp()
633+
),
634+
)
635+
for result in results:
636+
result.pop('commit_id')
637+
self.assertListEqual(
638+
list1=results,
639+
list2=[],
640+
)

0 commit comments

Comments
 (0)