Skip to content

Commit 96e824d

Browse files
Merge pull request #211 from alok1304/gsoc-report-25
Add gsoc25 report - Alok Kumar
2 parents bfbe7b3 + 20aee02 commit 96e824d

File tree

2 files changed

+202
-0
lines changed

2 files changed

+202
-0
lines changed

docs/source/archive/gsoc-toc.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ GSoC 2025
1414
.. toctree::
1515
:maxdepth: 2
1616

17+
gsoc/reports/2025/scancode_toolkit_alok
1718
gsoc/reports/2025/vulnerablecode_michael
1819

1920
GSoC 2024
Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
========================================================================
2+
Have variable license sections in license rules
3+
========================================================================
4+
5+
**Organization:** `AboutCode <https://aboutcode.org>`_
6+
7+
**Projects:** `Scancode Toolkit <https://github.com/aboutcode-org/scancode-toolkit>`_
8+
9+
**Mentee:** `Alok Kumar (alok1304) <https://github.com/alok1304>`_
10+
11+
**Mentors:**
12+
13+
- `Philippe Ombredanne <https://github.com/pombredanne>`_
14+
- `Ayan Sinha Mahapatra <https://github.com/AyanSinhaMahapatra>`_
15+
16+
Overview
17+
--------
18+
This project aims to enhance the `detection_log` by clearly indicating when `extra-words`
19+
are detected. These `extra-words` represent variable parts in the license rules, which
20+
previously caused the match score to fall below 100.
21+
22+
To address this issue, the implementation now verifies whether the `extra-words`
23+
appear in the correct position within the license text. If they do, the score is
24+
adjusted and improved accordingly, resulting in more accurate license rule matching.
25+
26+
--------------------------------------------------------------------------------
27+
28+
Implementation
29+
--------------
30+
31+
- **Enhanced the detection_log:**
32+
33+
- Display `extra-words` when they are detected.
34+
35+
- **Added extra-phrase marker like [[n]] for the extra-words:**
36+
37+
- The `extra-phrase` is denoted by double opening square brackets ``[[``
38+
and double closing square brackets ``]]``.
39+
- Here, `n` represents the maximum number of allowable `extra-words`.
40+
- The `extra-phrase` ``[[n]]`` is inserted in license rules at positions
41+
where `extra-words` may appear.
42+
- The value of `n` specifies how many `extra-words` are permitted
43+
at that location.
44+
45+
- **Improve Score:**
46+
47+
- Check whether `extra-words` appear in the correct position as defined by
48+
the `extra-phrase`, and ensure they do not exceed the maximum allowable limit.
49+
- If the conditions are satisfied, increase the match score to ``100``.
50+
51+
- **Shows in detection_log:**
52+
53+
- If the score is increased that means `extra-words` are in the correct
54+
position, then show ``extra-words-permitted-in-rule`` in the `detection_log`.
55+
- If the `extra-words` are at wrong place or exceed the maximum allowable limit,
56+
then show ``extra-words`` in the `detection_log`.
57+
58+
- **Testing:**
59+
60+
- Added tests for the `extra-phrase` functionality, such as
61+
`test_extra_phrase_tokenizer` and `test_extra_phrase_spans`, to ensure that
62+
phrases are correctly identified and processed.
63+
- Implemented multiple tests to verify that `extra-words` appear in the correct
64+
position according to the rules and that the match score is updated correctly
65+
when they are within the allowable limit.
66+
- Covered various edge cases where `extra-words` might be misplaced or exceed
67+
the maximum allowable count, ensuring the scoring and logging behave as expected.
68+
69+
--------------------------------------------------------------------------------
70+
71+
Linked Pull Requests
72+
--------------------
73+
74+
.. list-table::
75+
:widths: 10 60 30 10
76+
:header-rows: 1
77+
78+
* - Sr. no
79+
- Name
80+
- Link
81+
- Status
82+
* - 1
83+
- Display `extra-words` in `detection_log` if present
84+
- `aboutcode.org/scancode-toolkit#4402
85+
<https://github.com/aboutcode-org/scancode-toolkit/pull/4402>`_
86+
- Merged
87+
* - 2
88+
- Improve score by supporting `extra_phrase` for `extra-words` in rules
89+
- `aboutcode.org/scancode-toolkit#4432
90+
<https://github.com/aboutcode-org/scancode-toolkit/pull/4432>`_
91+
- Open
92+
* - 3
93+
- Add extra-phrase in rules
94+
- `aboutcode.org/scancode-toolkit#4518
95+
<https://github.com/aboutcode-org/scancode-toolkit/pull/4518>`_
96+
- Open
97+
98+
Related Issues
99+
--------------
100+
101+
.. list-table::
102+
:widths: 10 60 30
103+
:header-rows: 1
104+
105+
* - Sr. no
106+
- Name
107+
- Link
108+
* - 1
109+
- `extra-words` does not show up in detection_log properly
110+
- `#4400
111+
<https://github.com/aboutcode-org/scancode-toolkit/issues/4400>`_
112+
* - 2
113+
- Improve score when `extra-words`` are found in the correct position
114+
- `#4420
115+
<https://github.com/aboutcode-org/scancode-toolkit/issues/4420>`_
116+
117+
Pre GSoC Work
118+
-------------
119+
120+
Before GSoC, I had contributed the following PRs:
121+
122+
.. list-table::
123+
:widths: 10 60 30
124+
:header-rows: 1
125+
126+
* - Sr. no
127+
- Name
128+
- Link
129+
* - 1
130+
- Renaming the dependency attribute `is_resolved` to `is_pinned`
131+
- `aboutcode-org/scancode-workbench#638
132+
<https://github.com/aboutcode-org/scancode-workbench/pull/638>`_
133+
* - 2
134+
- Add test for all PyPI METADATA versions
135+
- `aboutcode-org/scancode-toolkit#4180
136+
<https://github.com/aboutcode-org/scancode-toolkit/pull/4180>`_
137+
* - 3
138+
- Add test for false positive GPL3 license
139+
- `aboutcode-org/scancode-toolkit#4106
140+
<https://github.com/aboutcode-org/scancode-toolkit/pull/4106>`_
141+
* - 4
142+
- Add new rules for EUPL license
143+
- `aboutcode-org/scancode-toolkit#4204
144+
<https://github.com/aboutcode-org/scancode-toolkit/pull/4204>`_
145+
* - 5
146+
- Add DUMB License and detection rule
147+
- `aboutcode-org/scancode-toolkit#4400
148+
<https://github.com/aboutcode-org/scancode-toolkit/issues/4400>`_
149+
* - 6
150+
- Fixing the dead link by cross-reference in the documentation
151+
- `aboutcode-org/purldb#550
152+
<https://github.com/aboutcode-org/purldb/pull/550>`_
153+
* - 7
154+
- Add test for equivalent word
155+
- `aboutcode-org/scancode-toolkit#4305
156+
<https://github.com/aboutcode-org/scancode-toolkit/pull/4305>`_
157+
* - 8
158+
- Enhance code visibility in dark mode
159+
- `aboutcode-org/scancode-workbench#637
160+
<https://github.com/aboutcode-org/scancode-workbench/pull/637>`_
161+
162+
Post GSoC
163+
---------
164+
165+
I plan to continue contributing by adding `extra-phrase` support across many
166+
license rules. This will strengthen license detection by making it more accurate
167+
and flexible in handling variations within the rules.
168+
169+
For identifying named entities in rules, I created a new repository i.e
170+
`named-entity-utils <https://github.com/alok1304/named-entity-utils>`_ which I am
171+
currently working on. This utility is used to add `extra-phrase` markers in rules
172+
at positions where named entities are present.
173+
174+
Links
175+
-----
176+
177+
* `Project Idea
178+
<https://github.com/aboutcode-org/aboutcode/wiki/GSOC-2025-project-ideas#have-variable-license-sections-in-license-rules>`_
179+
180+
* `Official GSoC project page
181+
<https://summerofcode.withgoogle.com/programs/2025/projects/EvCogGhq>`_
182+
183+
* `GSoC Proposal
184+
<https://docs.google.com/document/d/1vNgiO8g1RiKVym4qK_jVFsiUH2z5ztaz8Q5lW6NkRK0/edit?tab=t.0>`_
185+
186+
* `Project Board <https://github.com/orgs/aboutcode-org/projects/28>`_
187+
188+
Acknowledgements
189+
----------------
190+
191+
I would like to thank my mentors:
192+
193+
- `Philippe Ombredanne`_
194+
- `Ayan Sinha Mahapatra`_
195+
196+
A special thanks to my mentors who always supported me throughout this journey. Whenever
197+
I faced a problem, we discussed it in depth during our weekly status calls. Without
198+
their guidance and constant help, completing this project would not have been possible.
199+
200+
I also plan to explore more projects in AboutCode and contribute whenever I get
201+
time, because I would love to remain a part of this wonderful organization.

0 commit comments

Comments
 (0)