Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/archive/gsoc-toc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ GSoC 2025
.. toctree::
:maxdepth: 2

gsoc/reports/2025/scancode_toolkit_alok
gsoc/reports/2025/vulnerablecode_michael

GSoC 2024
Expand Down
201 changes: 201 additions & 0 deletions docs/source/archive/gsoc/reports/2025/scancode_toolkit_alok.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
========================================================================
Have variable license sections in license rules
========================================================================

**Organization:** `AboutCode <https://aboutcode.org>`_

**Projects:** `Scancode Toolkit <https://github.com/aboutcode-org/scancode-toolkit>`_

**Mentee:** `Alok Kumar (alok1304) <https://github.com/alok1304>`_

**Mentors:**

- `Philippe Ombredanne <https://github.com/pombredanne>`_
- `Ayan Sinha Mahapatra <https://github.com/AyanSinhaMahapatra>`_

Overview
--------
This project aims to enhance the `detection_log` by clearly indicating when `extra-words`
are detected. These `extra-words` represent variable parts in the license rules, which
previously caused the match score to fall below 100.

To address this issue, the implementation now verifies whether the `extra-words`
appear in the correct position within the license text. If they do, the score is
adjusted and improved accordingly, resulting in more accurate license rule matching.

--------------------------------------------------------------------------------

Implementation
--------------

- **Enhanced the detection_log:**

- Display `extra-words` when they are detected.

- **Added extra-phrase marker like [[n]] for the extra-words:**

- The `extra-phrase` is denoted by double opening square brackets ``[[``
and double closing square brackets ``]]``.
- Here, `n` represents the maximum number of allowable `extra-words`.
- The `extra-phrase` ``[[n]]`` is inserted in license rules at positions
where `extra-words` may appear.
- The value of `n` specifies how many `extra-words` are permitted
at that location.

- **Improve Score:**

- Check whether `extra-words` appear in the correct position as defined by
the `extra-phrase`, and ensure they do not exceed the maximum allowable limit.
- If the conditions are satisfied, increase the match score to ``100``.

- **Shows in detection_log:**

- If the score is increased that means `extra-words` are in the correct
position, then show ``extra-words-permitted-in-rule`` in the `detection_log`.
- If the `extra-words` are at wrong place or exceed the maximum allowable limit,
then show ``extra-words`` in the `detection_log`.

- **Testing:**

- Added tests for the `extra-phrase` functionality, such as
`test_extra_phrase_tokenizer` and `test_extra_phrase_spans`, to ensure that
phrases are correctly identified and processed.
- Implemented multiple tests to verify that `extra-words` appear in the correct
position according to the rules and that the match score is updated correctly
when they are within the allowable limit.
- Covered various edge cases where `extra-words` might be misplaced or exceed
the maximum allowable count, ensuring the scoring and logging behave as expected.

--------------------------------------------------------------------------------

Linked Pull Requests
--------------------

.. list-table::
:widths: 10 60 30 10
:header-rows: 1

* - Sr. no
- Name
- Link
- Status
* - 1
- Display `extra-words` in `detection_log` if present
- `aboutcode.org/scancode-toolkit#4402
<https://github.com/aboutcode-org/scancode-toolkit/pull/4402>`_
- Merged
* - 2
- Improve score by supporting `extra_phrase` for `extra-words` in rules
- `aboutcode.org/scancode-toolkit#4432
<https://github.com/aboutcode-org/scancode-toolkit/pull/4432>`_
- Open
* - 3
- Add extra-phrase in rules
- `aboutcode.org/scancode-toolkit#4518
<https://github.com/aboutcode-org/scancode-toolkit/pull/4518>`_
- Open

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add aboutcode-org/scancode-toolkit#4518 and other repos you might have created, even if this is not ready/mergable:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Contributor Author

@alok1304 alok1304 Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only add this https://github.com/alok1304/named-entity-utils because i am using this repo to identify named entity and mark extra-phrase.

i added this in post gsoc

Related Issues
--------------

.. list-table::
:widths: 10 60 30
:header-rows: 1

* - Sr. no
- Name
- Link
* - 1
- `extra-words` does not show up in detection_log properly
- `#4400
<https://github.com/aboutcode-org/scancode-toolkit/issues/4400>`_
* - 2
- Improve score when `extra-words`` are found in the correct position
- `#4420
<https://github.com/aboutcode-org/scancode-toolkit/issues/4420>`_

Pre GSoC Work
-------------

Before GSoC, I had contributed the following PRs:

.. list-table::
:widths: 10 60 30
:header-rows: 1

* - Sr. no
- Name
- Link
* - 1
- Renaming the dependency attribute `is_resolved` to `is_pinned`
- `aboutcode-org/scancode-workbench#638
<https://github.com/aboutcode-org/scancode-workbench/pull/638>`_
* - 2
- Add test for all PyPI METADATA versions
- `aboutcode-org/scancode-toolkit#4180
<https://github.com/aboutcode-org/scancode-toolkit/pull/4180>`_
* - 3
- Add test for false positive GPL3 license
- `aboutcode-org/scancode-toolkit#4106
<https://github.com/aboutcode-org/scancode-toolkit/pull/4106>`_
* - 4
- Add new rules for EUPL license
- `aboutcode-org/scancode-toolkit#4204
<https://github.com/aboutcode-org/scancode-toolkit/pull/4204>`_
* - 5
- Add DUMB License and detection rule
- `aboutcode-org/scancode-toolkit#4400
<https://github.com/aboutcode-org/scancode-toolkit/issues/4400>`_
* - 6
- Fixing the dead link by cross-reference in the documentation
- `aboutcode-org/purldb#550
<https://github.com/aboutcode-org/purldb/pull/550>`_
* - 7
- Add test for equivalent word
- `aboutcode-org/scancode-toolkit#4305
<https://github.com/aboutcode-org/scancode-toolkit/pull/4305>`_
* - 8
- Enhance code visibility in dark mode
- `aboutcode-org/scancode-workbench#637
<https://github.com/aboutcode-org/scancode-workbench/pull/637>`_

Post GSoC
---------

I plan to continue contributing by adding `extra-phrase` support across many
license rules. This will strengthen license detection by making it more accurate
and flexible in handling variations within the rules.

For identifying named entities in rules, I created a new repository i.e
`named-entity-utils <https://github.com/alok1304/named-entity-utils>`_ which I am
currently working on. This utility is used to add `extra-phrase` markers in rules
at positions where named entities are present.

Links
-----

* `Project Idea
<https://github.com/aboutcode-org/aboutcode/wiki/GSOC-2025-project-ideas#have-variable-license-sections-in-license-rules>`_

* `Official GSoC project page
<https://summerofcode.withgoogle.com/programs/2025/projects/EvCogGhq>`_

* `GSoC Proposal
<https://docs.google.com/document/d/1vNgiO8g1RiKVym4qK_jVFsiUH2z5ztaz8Q5lW6NkRK0/edit?tab=t.0>`_

* `Project Board <https://github.com/orgs/aboutcode-org/projects/28>`_

Acknowledgements
----------------

I would like to thank my mentors:

- `Philippe Ombredanne`_
- `Ayan Sinha Mahapatra`_

A special thanks to my mentors who always supported me throughout this journey. Whenever
I faced a problem, we discussed it in depth during our weekly status calls. Without
their guidance and constant help, completing this project would not have been possible.

I also plan to explore more projects in AboutCode and contribute whenever I get
time, because I would love to remain a part of this wonderful organization.