- added from_timestamp parameter to allow incremental scanning as needed

Gal Ben David · Gal Ben David · commit a3f062523a79 · 2020-08-30T23:20:50.000+03:00
in CI pipelines. this parameter would help to reduce the time it takes
  to scan the whole repository.
- README.md file now includes a full documentation.
diff --git a/README.md b/README.md
@@ -21,6 +21,7 @@
     - [CPU](#cpu)
   - [Prerequisites](#prerequisites)
   - [Installation](#installation)
+- [Documentation](#documentation)
 - [Usage](#usage)
 - [License](#license)
 - [Contact](#contact)
@@ -66,6 +67,93 @@ pip3 install PyRepScan
 ```
 
 
+## Documentation
+
+```python
+class GitRepositoryScanner:
+    def __init__(
+      self,
+    ) -> None
+```
+This class holds all the added rules for fast reuse.
+
+
+```python
+def add_rule(
+    self,
+    name: str,
+    match_pattern: str,
+    match_whitelist_patterns: typing.List[str],
+    match_blacklist_patterns: typing.List[str],
+) -> None
+```
+The `add_rule` function adds a new rule to an internal list of rules that could be reused multiple times against different repositories. The same name can be used multiple times and would lead to results which can hold the same name.
+- `name` - The name of the rule so it can be identified.
+- `match_pattern` - The regex pattern (RE2 syntax) to match against the content of the commited files.
+- `match_whitelist_patterns` - A list of regex patterns (RE2 syntax) to match against the content of the committed file to filter in results. Only one of the patterns should be matched to pass through the result. There is an OR relation between the patterns.
+- `match_blacklist_patterns` - A list of regex patterns (RE2 syntax) to match against the content of the committed file to filter out results. Only one of the patterns should be matched to omit the result. There is an OR relation between the patterns.
+
+
+```python
+def add_ignored_file_extension(
+    self,
+    file_extension: str,
+) -> None
+```
+The `add_ignored_file_extension` function adds a new file extension to the filtering phase to reduce the amount of inspected files and to increase the performance of the scan.
+- `file_extension` - A file extension, without a leading dot, to filter out from the scan.
+
+
+```python
+def add_ignored_file_path(
+    self,
+    file_path: str,
+) -> None
+```
+The `add_ignored_file_path` function adds a new file pattern to the filtering phase to reduce the amount of inspected files and to increase the performance of the scan. Every file path that would include the `file_path` substring would be left out of the scanned files.
+- `file_path` - If the inspected file path would include this substring, it won't be scanned. This parameter is a free text.
+
+
+```python
+def scan(
+    self,
+    repository_path: str,
+    branch_glob_pattern: '*',
+    from_timestamp: int,
+) -> typing.List[typing.Dict[str, str]]
+```
+The `scan` function is the main function in the library. Calling this function would trigger a new scan that would return a list of matches. The scan function is a multithreaded operation, that would utilize all the available core in the system. The results would not include the file content but only the regex matching group. To retrieve the full file content one should take the `results['oid']` and to call `get_file_content` function.
+- `repository_path` - The git repository folder path.
+- `branch_glob_pattern` - A glob pattern to filter branches for the scan.
+- `from_timestamp` - A UTC timestamp (Int) that only commits that were created after this timestamp would be included in the scan.
+
+A sample result would look like this:
+```python
+{
+    'rule_name': 'First Rule',
+    'author_email': 'author@email.email',
+    'author_name': 'Author Name',
+    'commit_id': '1111111111111111111111111111111111111111',
+    'commit_message': 'The commit message',
+    'commit_time': '2020-01-01T00:00:00e',
+    'file_path': 'full/file/path',
+    'file_oid': '47d2739ba2c34690248c8f91b84bb54e8936899a',
+    'match': 'The matched group',
+}
+```
+
+
+```python
+def get_file_content(
+    repository_path: str,
+    file_oid: str,
+) -> bytes
+```
+The `get_file_content` function exists to retrieve the content of a file that was previously matched. The full file content is omitted from the results to reduce the results list size and to deliver better performance.
+- `repository_path` - The git repository folder path.
+- `file_oid` - A string representing the file oid. This parameter exists in the results dictionary returned by the `scan` function.
+
+
 ## Usage
 
 ```python
@@ -101,6 +189,7 @@ grs.add_ignored_file_path(
 results = grs.scan(
     repository_path='/repository/path',
     branch_glob_pattern='*',
+    from_timestamp=0,
 )
 
 # Results is a list of dicts. Each dict is in the following format:
diff --git a/setup.py b/setup.py
@@ -5,7 +5,7 @@
 
 setuptools.setup(
     name='PyRepScan',
-    version='0.4.1',
+    version='0.5.0',
     author='Gal Ben David',
     author_email='gal@intsights.com',
     url='https://github.com/intsights/PyRepScan',
diff --git a/src/git_repository_scanner.cpp b/src/git_repository_scanner.cpp
@@ -47,7 +47,8 @@ class GitRepositoryScanner {
 
     std::vector<git_oid> get_oids(
         git_repository * git_repo,
-        std::string branch_glob_pattern
+        std::string branch_glob_pattern,
+        std::int64_t from_timestamp
     ) {
         git_revwalk * repo_revwalk = nullptr;
         git_oid oid;
@@ -57,8 +58,14 @@ class GitRepositoryScanner {
         git_revwalk_sorting(repo_revwalk, GIT_SORT_TIME);
         git_revwalk_push_glob(repo_revwalk, branch_glob_pattern.c_str());
 
+        git_commit * current_commit = nullptr;
         while (git_revwalk_next(&oid, repo_revwalk) == 0) {
-            oids.push_back(oid);
+            git_commit_lookup(&current_commit, git_repo, &oid);
+            git_time_t commit_time = git_commit_time(current_commit);
+            if (commit_time >= from_timestamp) {
+                oids.push_back(oid);
+            }
+            git_commit_free(current_commit);
         }
 
         git_revwalk_free(repo_revwalk);
@@ -175,7 +182,8 @@ class GitRepositoryScanner {
 
     std::vector<std::map<std::string, std::string>> scan(
         std::string repository_path,
-        std::string branch_glob_pattern
+        std::string branch_glob_pattern,
+        std::int64_t from_timestamp
     ) {
         git_repository * git_repo = nullptr;
         if (0 != git_repository_open(&git_repo, repository_path.c_str())) {
@@ -185,7 +193,8 @@ class GitRepositoryScanner {
         std::vector<std::map<std::string, std::string>> results;
         std::vector<git_oid> oids = this->get_oids(
             git_repo,
-            branch_glob_pattern
+            branch_glob_pattern,
+            from_timestamp
         );
 
         tf::Taskflow taskflow;
@@ -302,7 +311,8 @@ PYBIND11_MODULE(pyrepscan, m) {
             &GitRepositoryScanner::scan,
             "Scan a repository for secrets",
             pybind11::arg("repository_path"),
-            pybind11::arg("branch_glob_pattern")
+            pybind11::arg("branch_glob_pattern"),
+            pybind11::arg("from_timestamp")
         )
         .def(
             "add_rule",
diff --git a/tests/test_pyrepscan.py b/tests/test_pyrepscan.py
@@ -1,6 +1,7 @@
 import unittest
 import tempfile
 import git
+import datetime
 
 import pyrepscan
 
@@ -468,6 +469,7 @@ def test_scan(
             results = grs.scan(
                 repository_path=tmpdir,
                 branch_glob_pattern='*master',
+                from_timestamp=0,
             )
             for result in results:
                 result.pop('commit_id')
@@ -510,6 +512,7 @@ def test_scan(
             results = grs.scan(
                 repository_path=tmpdir,
                 branch_glob_pattern='*',
+                from_timestamp=0,
             )
             for result in results:
                 result.pop('commit_id')
@@ -580,3 +583,58 @@ def test_scan(
                     file_oid='6b584e8ece562ebffc15d38808cd6b98fc3d97ea',
                 ),
             )
+
+            results = grs.scan(
+                repository_path=tmpdir,
+                branch_glob_pattern='*',
+                from_timestamp=int(
+                    datetime.datetime(
+                        year=2004,
+                        month=1,
+                        day=1,
+                        hour=0,
+                        minute=0,
+                        second=0,
+                        tzinfo=datetime.timezone.utc,
+                    ).timestamp()
+                ),
+            )
+            for result in results:
+                result.pop('commit_id')
+            self.assertCountEqual(
+                first=results,
+                second=[
+                    {
+                        'author_email': 'test@author.email',
+                        'author_name': 'Author Name',
+                        'commit_message': 'edited file in non_merged_branch',
+                        'commit_time': '2004-01-01T00:00:00',
+                        'file_oid': '057032a2108721ad1de6a9240fd1a8f45bc3f2ef',
+                        'file_path': 'file.txt',
+                        'match': 'content',
+                        'rule_name': 'First Rule'
+                    },
+                ],
+            )
+
+            results = grs.scan(
+                repository_path=tmpdir,
+                branch_glob_pattern='*',
+                from_timestamp=int(
+                    datetime.datetime(
+                        year=2004,
+                        month=1,
+                        day=1,
+                        hour=0,
+                        minute=0,
+                        second=1,
+                        tzinfo=datetime.timezone.utc,
+                    ).timestamp()
+                ),
+            )
+            for result in results:
+                result.pop('commit_id')
+            self.assertListEqual(
+                list1=results,
+                list2=[],
+            )