From d4d05d10453992bc5028d37af9eeed5fe13d6a55 Mon Sep 17 00:00:00 2001 From: Alok Kumar Date: Wed, 27 Aug 2025 18:55:43 +0530 Subject: [PATCH 1/3] add gsoc25 report Signed-off-by: Alok Kumar --- docs/source/archive/gsoc-toc.rst | 8 + .../reports/2025/scancode_toolkit_alok.rst | 162 ++++++++++++++++++ 2 files changed, 170 insertions(+) create mode 100644 docs/source/archive/gsoc/reports/2025/scancode_toolkit_alok.rst diff --git a/docs/source/archive/gsoc-toc.rst b/docs/source/archive/gsoc-toc.rst index 421be09..f29f060 100755 --- a/docs/source/archive/gsoc-toc.rst +++ b/docs/source/archive/gsoc-toc.rst @@ -8,6 +8,14 @@ designed to encourage university student participation in open source software development. It was started by Google in 2005. More about GSoC - ``_ +GSoC 2025 +--------- + +.. toctree:: + :maxdepth: 2 + + gsoc/reports/2025/scancode_toolkit_alok + GSoC 2024 --------- diff --git a/docs/source/archive/gsoc/reports/2025/scancode_toolkit_alok.rst b/docs/source/archive/gsoc/reports/2025/scancode_toolkit_alok.rst new file mode 100644 index 0000000..dd573ee --- /dev/null +++ b/docs/source/archive/gsoc/reports/2025/scancode_toolkit_alok.rst @@ -0,0 +1,162 @@ +======================================================================== +Have variable license sections in license rules +======================================================================== + +**Organization:** `AboutCode `_ + +**Projects:** `Scancode Toolkit `_ + +**Mentee:** `Alok Kumar (alok1304) `_ + +**Mentors:** + +- `Philippe Ombredanne `_ +- `Ayan Sinha Mahapatra `_ + +Overview +-------- +This project aims to enhance the `detection_log` by clearly indicating when `extra-words` +are detected. These `extra-words` represent variable parts in the license rules, which +previously caused the match score to fall below 100. + +To address this issue, the implementation now verifies whether the `extra-words` +appear in the correct position within the license text. If they do, the score is +adjusted and improved accordingly, resulting in more accurate license rule matching. + +-------------------------------------------------------------------------------- + +Implementation +-------------- + +- **Enhanced the detection_log:** + + - Display `extra-words` when they are detected. + +- **Added extra-phrase marker like [[n]] for the extra-words:** + + - The `extra-phrase` is denoted by double opening square brackets ``[[`` + and double closing square brackets ``]]``. + - Here, `n` represents the maximum number of allowable `extra-words`. + - The `extra-phrase` ``[[n]]`` is inserted in license rules at positions + where `extra-words` may appear. + - The value of `n` specifies how many `extra-words` are permitted + at that location. + +- **Improve Score:** + + - Check whether `extra-words` appear in the correct position as defined by + the `extra-phrase`, and ensure they do not exceed the maximum allowable limit. + - If the conditions are satisfied, increase the match score to ``100``. + +- **Shows in detection_log:** + + - If the score is increased that means `extra-words` are in the correct + position, then show ``extra-words-permitted-in-rule`` in the `detection_log`. + - If the `extra-words` are at wrong place or exceed the maximum allowable limit, + then show ``extra-words`` in the `detection_log`. + +- **Testing:** + + - Added tests for the `extra-phrase` functionality, such as + `test_extra_phrase_tokenizer` and `test_extra_phrase_spans`, to ensure that + phrases are correctly identified and processed. + - Implemented multiple tests to verify that `extra-words` appear in the correct + position according to the rules and that the match score is updated correctly + when they are within the allowable limit. + - Covered various edge cases where `extra-words` might be misplaced or exceed + the maximum allowable count, ensuring the scoring and logging behave as expected. + +Linked Pull Requests +-------------------- + +.. list-table:: + :widths: 10 60 30 10 + :header-rows: 1 + + * - Sr. no + - Name + - Link + - Status + * - 1 + - Display `extra-words` in `detection_log` if present + - `aboutcode.org/scancode-toolkit#4402 + `_ + - Merged + * - 2 + - Improve score by supporting `extra_phrase` for `extra-words` in rules + - `aboutcode.org/scancode-toolkit#4432 + `_ + - Open + +Related Issues +-------------- + +.. list-table:: + :widths: 10 60 30 + :header-rows: 1 + + * - Sr. no + - Name + - Link + * - 1 + - `extra-words` does not show up in detection_log properly + - `#4400 + `_ + * - 2 + - Improve score when `extra-words`` are found in the correct position + - `#4420 + `_ + +Pre GSoC Work +------------- + +Before GSoC, I had contributed the following PRs: + +- `Renaming the dependency attribute is_resolved to is_pinned + `_ +- `Add test for all PyPI METADATA versions + `_ +- `Add test for false positive GPL3 license + `_ +- `Add new rules for EUPL license + `_ +- `Add DUMB License and detection rule + `_ +- `Fixing the dead link by cross-reference in the documentation + `_ + +Post GSoC +--------- + +I plan to continue contributing by adding `extra-phrase` support across many +license rules. This will strengthen license detection by making it more accurate +and flexible in handling variations within the rules. + +Links +----- + +* `Project Idea + `_ + +* `Official GSoC project page + `_ + +* `GSoC Proposal + `_ + +* `Project Board `_ + +Acknowledgements +---------------- + +I would like to thank my mentors: + +- `Philippe Ombredanne`_ +- `Ayan Sinha Mahapatra`_ + +A special thanks to my mentors who always supported me throughout this journey. Whenever +I faced a problem, we discussed it in depth during our weekly status calls. Without +their guidance and constant help, completing this project would not have been possible. + +I also plan to explore more projects in AboutCode and contribute whenever I get +time, because I would love to remain a part of this wonderful organization. From 81212f32a4ad359bb344514494f575f431df1037 Mon Sep 17 00:00:00 2001 From: Alok Kumar Date: Wed, 27 Aug 2025 22:23:00 +0530 Subject: [PATCH 2/3] Update scancode_toolkit_alok.rst Signed-off-by: Alok Kumar --- .../reports/2025/scancode_toolkit_alok.rst | 324 +++++++++--------- 1 file changed, 162 insertions(+), 162 deletions(-) diff --git a/docs/source/archive/gsoc/reports/2025/scancode_toolkit_alok.rst b/docs/source/archive/gsoc/reports/2025/scancode_toolkit_alok.rst index dd573ee..b58f417 100644 --- a/docs/source/archive/gsoc/reports/2025/scancode_toolkit_alok.rst +++ b/docs/source/archive/gsoc/reports/2025/scancode_toolkit_alok.rst @@ -1,162 +1,162 @@ -======================================================================== -Have variable license sections in license rules -======================================================================== - -**Organization:** `AboutCode `_ - -**Projects:** `Scancode Toolkit `_ - -**Mentee:** `Alok Kumar (alok1304) `_ - -**Mentors:** - -- `Philippe Ombredanne `_ -- `Ayan Sinha Mahapatra `_ - -Overview --------- -This project aims to enhance the `detection_log` by clearly indicating when `extra-words` -are detected. These `extra-words` represent variable parts in the license rules, which -previously caused the match score to fall below 100. - -To address this issue, the implementation now verifies whether the `extra-words` -appear in the correct position within the license text. If they do, the score is -adjusted and improved accordingly, resulting in more accurate license rule matching. - --------------------------------------------------------------------------------- - -Implementation --------------- - -- **Enhanced the detection_log:** - - - Display `extra-words` when they are detected. - -- **Added extra-phrase marker like [[n]] for the extra-words:** - - - The `extra-phrase` is denoted by double opening square brackets ``[[`` - and double closing square brackets ``]]``. - - Here, `n` represents the maximum number of allowable `extra-words`. - - The `extra-phrase` ``[[n]]`` is inserted in license rules at positions - where `extra-words` may appear. - - The value of `n` specifies how many `extra-words` are permitted - at that location. - -- **Improve Score:** - - - Check whether `extra-words` appear in the correct position as defined by - the `extra-phrase`, and ensure they do not exceed the maximum allowable limit. - - If the conditions are satisfied, increase the match score to ``100``. - -- **Shows in detection_log:** - - - If the score is increased that means `extra-words` are in the correct - position, then show ``extra-words-permitted-in-rule`` in the `detection_log`. - - If the `extra-words` are at wrong place or exceed the maximum allowable limit, - then show ``extra-words`` in the `detection_log`. - -- **Testing:** - - - Added tests for the `extra-phrase` functionality, such as - `test_extra_phrase_tokenizer` and `test_extra_phrase_spans`, to ensure that - phrases are correctly identified and processed. - - Implemented multiple tests to verify that `extra-words` appear in the correct - position according to the rules and that the match score is updated correctly - when they are within the allowable limit. - - Covered various edge cases where `extra-words` might be misplaced or exceed - the maximum allowable count, ensuring the scoring and logging behave as expected. - -Linked Pull Requests --------------------- - -.. list-table:: - :widths: 10 60 30 10 - :header-rows: 1 - - * - Sr. no - - Name - - Link - - Status - * - 1 - - Display `extra-words` in `detection_log` if present - - `aboutcode.org/scancode-toolkit#4402 - `_ - - Merged - * - 2 - - Improve score by supporting `extra_phrase` for `extra-words` in rules - - `aboutcode.org/scancode-toolkit#4432 - `_ - - Open - -Related Issues --------------- - -.. list-table:: - :widths: 10 60 30 - :header-rows: 1 - - * - Sr. no - - Name - - Link - * - 1 - - `extra-words` does not show up in detection_log properly - - `#4400 - `_ - * - 2 - - Improve score when `extra-words`` are found in the correct position - - `#4420 - `_ - -Pre GSoC Work -------------- - -Before GSoC, I had contributed the following PRs: - -- `Renaming the dependency attribute is_resolved to is_pinned - `_ -- `Add test for all PyPI METADATA versions - `_ -- `Add test for false positive GPL3 license - `_ -- `Add new rules for EUPL license - `_ -- `Add DUMB License and detection rule - `_ -- `Fixing the dead link by cross-reference in the documentation - `_ - -Post GSoC ---------- - -I plan to continue contributing by adding `extra-phrase` support across many -license rules. This will strengthen license detection by making it more accurate -and flexible in handling variations within the rules. - -Links ------ - -* `Project Idea - `_ - -* `Official GSoC project page - `_ - -* `GSoC Proposal - `_ - -* `Project Board `_ - -Acknowledgements ----------------- - -I would like to thank my mentors: - -- `Philippe Ombredanne`_ -- `Ayan Sinha Mahapatra`_ - -A special thanks to my mentors who always supported me throughout this journey. Whenever -I faced a problem, we discussed it in depth during our weekly status calls. Without -their guidance and constant help, completing this project would not have been possible. - -I also plan to explore more projects in AboutCode and contribute whenever I get -time, because I would love to remain a part of this wonderful organization. +======================================================================== +Have variable license sections in license rules +======================================================================== + +**Organization:** `AboutCode `_ + +**Projects:** `Scancode Toolkit `_ + +**Mentee:** `Alok Kumar (alok1304) `_ + +**Mentors:** + +- `Philippe Ombredanne `_ +- `Ayan Sinha Mahapatra `_ + +Overview +-------- +This project aims to enhance the `detection_log` by clearly indicating when `extra-words` +are detected. These `extra-words` represent variable parts in the license rules, which +previously caused the match score to fall below 100. + +To address this issue, the implementation now verifies whether the `extra-words` +appear in the correct position within the license text. If they do, the score is +adjusted and improved accordingly, resulting in more accurate license rule matching. + +-------------------------------------------------------------------------------- + +Implementation +-------------- + +- **Enhanced the detection_log:** + + - Display `extra-words` when they are detected. + +- **Added extra-phrase marker like [[n]] for the extra-words:** + + - The `extra-phrase` is denoted by double opening square brackets ``[[`` + and double closing square brackets ``]]``. + - Here, `n` represents the maximum number of allowable `extra-words`. + - The `extra-phrase` ``[[n]]`` is inserted in license rules at positions + where `extra-words` may appear. + - The value of `n` specifies how many `extra-words` are permitted + at that location. + +- **Improve Score:** + + - Check whether `extra-words` appear in the correct position as defined by + the `extra-phrase`, and ensure they do not exceed the maximum allowable limit. + - If the conditions are satisfied, increase the match score to ``100``. + +- **Shows in detection_log:** + + - If the score is increased that means `extra-words` are in the correct + position, then show ``extra-words-permitted-in-rule`` in the `detection_log`. + - If the `extra-words` are at wrong place or exceed the maximum allowable limit, + then show ``extra-words`` in the `detection_log`. + +- **Testing:** + + - Added tests for the `extra-phrase` functionality, such as + `test_extra_phrase_tokenizer` and `test_extra_phrase_spans`, to ensure that + phrases are correctly identified and processed. + - Implemented multiple tests to verify that `extra-words` appear in the correct + position according to the rules and that the match score is updated correctly + when they are within the allowable limit. + - Covered various edge cases where `extra-words` might be misplaced or exceed + the maximum allowable count, ensuring the scoring and logging behave as expected. + +Linked Pull Requests +-------------------- + +.. list-table:: + :widths: 10 60 30 10 + :header-rows: 1 + + * - Sr. no + - Name + - Link + - Status + * - 1 + - Display `extra-words` in `detection_log` if present + - `aboutcode.org/scancode-toolkit#4402 + `_ + - Merged + * - 2 + - Improve score by supporting `extra_phrase` for `extra-words` in rules + - `aboutcode.org/scancode-toolkit#4432 + `_ + - Open + +Related Issues +-------------- + +.. list-table:: + :widths: 10 60 30 + :header-rows: 1 + + * - Sr. no + - Name + - Link + * - 1 + - `extra-words` does not show up in detection_log properly + - `#4400 + `_ + * - 2 + - Improve score when `extra-words`` are found in the correct position + - `#4420 + `_ + +Pre GSoC Work +------------- + +Before GSoC, I had contributed the following PRs: + +- `Renaming the dependency attribute is_resolved to is_pinned + `_ +- `Add test for all PyPI METADATA versions + `_ +- `Add test for false positive GPL3 license + `_ +- `Add new rules for EUPL license + `_ +- `Add DUMB License and detection rule + `_ +- `Fixing the dead link by cross-reference in the documentation + `_ + +Post GSoC +--------- + +I plan to continue contributing by adding `extra-phrase` support across many +license rules. This will strengthen license detection by making it more accurate +and flexible in handling variations within the rules. + +Links +----- + +* `Project Idea + `_ + +* `Official GSoC project page + `_ + +* `GSoC Proposal + `_ + +* `Project Board `_ + +Acknowledgements +---------------- + +I would like to thank my mentors: + +- `Philippe Ombredanne`_ +- `Ayan Sinha Mahapatra`_ + +A special thanks to my mentors who always supported me throughout this journey. Whenever +I faced a problem, we discussed it in depth during our weekly status calls. Without +their guidance and constant help, completing this project would not have been possible. + +I also plan to explore more projects in AboutCode and contribute whenever I get +time, because I would love to remain a part of this wonderful organization. From ab0eb747c9e6a8b3e7c36c390ecb8968b3a6243e Mon Sep 17 00:00:00 2001 From: Alok Kumar Date: Thu, 28 Aug 2025 19:05:23 +0530 Subject: [PATCH 3/3] updated gsoc report Signed-off-by: Alok Kumar --- .../reports/2025/scancode_toolkit_alok.rst | 63 +++++++++++++++---- 1 file changed, 51 insertions(+), 12 deletions(-) diff --git a/docs/source/archive/gsoc/reports/2025/scancode_toolkit_alok.rst b/docs/source/archive/gsoc/reports/2025/scancode_toolkit_alok.rst index b58f417..1694a3b 100644 --- a/docs/source/archive/gsoc/reports/2025/scancode_toolkit_alok.rst +++ b/docs/source/archive/gsoc/reports/2025/scancode_toolkit_alok.rst @@ -66,6 +66,8 @@ Implementation - Covered various edge cases where `extra-words` might be misplaced or exceed the maximum allowable count, ensuring the scoring and logging behave as expected. +-------------------------------------------------------------------------------- + Linked Pull Requests -------------------- @@ -87,6 +89,11 @@ Linked Pull Requests - `aboutcode.org/scancode-toolkit#4432 `_ - Open + * - 3 + - Add extra-phrase in rules + - `aboutcode.org/scancode-toolkit#4518 + `_ + - Open Related Issues -------------- @@ -112,18 +119,45 @@ Pre GSoC Work Before GSoC, I had contributed the following PRs: -- `Renaming the dependency attribute is_resolved to is_pinned - `_ -- `Add test for all PyPI METADATA versions - `_ -- `Add test for false positive GPL3 license - `_ -- `Add new rules for EUPL license - `_ -- `Add DUMB License and detection rule - `_ -- `Fixing the dead link by cross-reference in the documentation - `_ +.. list-table:: + :widths: 10 60 30 + :header-rows: 1 + + * - Sr. no + - Name + - Link + * - 1 + - Renaming the dependency attribute `is_resolved` to `is_pinned` + - `aboutcode-org/scancode-workbench#638 + `_ + * - 2 + - Add test for all PyPI METADATA versions + - `aboutcode-org/scancode-toolkit#4180 + `_ + * - 3 + - Add test for false positive GPL3 license + - `aboutcode-org/scancode-toolkit#4106 + `_ + * - 4 + - Add new rules for EUPL license + - `aboutcode-org/scancode-toolkit#4204 + `_ + * - 5 + - Add DUMB License and detection rule + - `aboutcode-org/scancode-toolkit#4400 + `_ + * - 6 + - Fixing the dead link by cross-reference in the documentation + - `aboutcode-org/purldb#550 + `_ + * - 7 + - Add test for equivalent word + - `aboutcode-org/scancode-toolkit#4305 + `_ + * - 8 + - Enhance code visibility in dark mode + - `aboutcode-org/scancode-workbench#637 + `_ Post GSoC --------- @@ -132,6 +166,11 @@ I plan to continue contributing by adding `extra-phrase` support across many license rules. This will strengthen license detection by making it more accurate and flexible in handling variations within the rules. +For identifying named entities in rules, I created a new repository i.e +`named-entity-utils `_ which I am +currently working on. This utility is used to add `extra-phrase` markers in rules +at positions where named entities are present. + Links -----