-
-
Notifications
You must be signed in to change notification settings - Fork 7
Clean projects #731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Clean projects #731
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @mshannon-sil)
silnlp/common/clean_projects.py
line 412 at r1 (raw file):
all_folders.append(item) test = True
Was this supposed to be included? Or was it left over from debugging/testing?
I've removed the test code and also put "TermRenderings.xml" in lower case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @davidbaines and @mshannon-sil)
silnlp/common/clean_projects.py
line 412 at r1 (raw file):
Previously, benjaminking (Ben King) wrote…
Was this supposed to be included? Or was it left over from debugging/testing?
I believe the "test" variable should also be removed since it's no longer used.
Thanks Ben good catch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @mshannon-sil)
silnlp/common/clean_projects.py
line 344 at r3 (raw file):
# --- Configure Logging --- #log_formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s") log_formatter = logging.Formatter("2025-05-29 14:30:00,000 - %(levelname)s - %(message)s")
One last thing: the hard-coded date in the format string here will cause all of the log messages to print "2025-05-29"
Thanks, Ben. I've reinstated the correct log Formatter. |
@mshannon-sil I think that I've made all the requested changes - are you able to review this too while Ben is away? |
Yes, I'll add my review shortly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments reviewing the file as a whole rather than just your changes, since I found a few issues and this PR is aiming to improve the clean_projects module.
Reviewed all commit messages.
Reviewable status: 0 of 1 files reviewed, 9 unresolved discussions (waiting on @benjaminking and @davidbaines)
silnlp/common/clean_projects.py
line 77 at r4 (raw file):
"frtbak.sty", "wordanalyses.xml", "bookNames.xml",
Shouldn't this be all lower case to match the other filenames here?
silnlp/common/clean_projects.py
line 87 at r4 (raw file):
".dic", ".ldml", ".lds",
This is in both extensions to keep and delete. Is that intended?
silnlp/common/clean_projects.py
line 167 at r4 (raw file):
if self.args.verbose > 0: # Condition to buffer this warning self._log_info(warning_msg) self.parsing_errors.append(f"BiblicalTermsListSetting file not found: {self.project_settings.biblical_terms_file_name})")
There's an extra parenthesis at the end of the string.
silnlp/common/clean_projects.py
line 223 at r4 (raw file):
delete_file = True reason = "specific name" elif any(item_path.match(pattern) for pattern in FILES_TO_DELETE_BY_PATTERN):
I think this should also compare against the lower case version of the item_path
silnlp/common/clean_projects.py
line 350 at r4 (raw file):
if args.verbose == 0: console_handler.setLevel(logging.CRITICAL + 1) elif args.verbose == 1:
The elif and else statement both do the same thing here.
silnlp/common/clean_projects.py
line 375 at r4 (raw file):
# Initial scan for all items to determine directories initial_items = list(projects_root_path.glob("*"))
glob("*")
doesn't include folders/files that start with a dot e.g. .cache
. You might want to do projects_root_path.listdir()
to get a list of all items instead.
silnlp/common/clean_projects.py
line 388 at r4 (raw file):
found_total_msg = f"Found {len(all_folders)} total directories in {args.projects_root}." logger.info(found_total_msg) if args.verbose > 0:
found_total_message
is being logged/printed twice here.
silnlp/common/clean_projects.py
line 423 at r4 (raw file):
found_msg = f"Found {len(project_folders)} project folders." logger.info(found_msg) if args.verbose > 0:
found_msg
is also logged/printed twice here. And there are multiple other occurrences of the same issue in this file so it would be good to look at every case of args.verbose > 0
to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, 9 unresolved discussions (waiting on @davidbaines and @mshannon-sil)
silnlp/common/clean_projects.py
line 324 at r4 (raw file):
"projects_root", nargs="?", default=PROJECTS_FOLDER_DEFAULT,
I think it would be better for now to remove this as a default value to avoid a situation where identical inputs to a script results in different behavior on different platforms.
Thanks for checking the whole script. I hope that I've addressed all the issues and not introduced new ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mshannon-sil reviewed all commit messages.
Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @benjaminking and @davidbaines)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benjaminking reviewed 1 of 1 files at r7, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @davidbaines)
silnlp/common/clean_projects.py
line 4 at r7 (raw file):
import argparse from ast import Raise
This looks like it might have been imported by mistake.
silnlp/common/clean_projects.py
line 13 at r7 (raw file):
from typing import Optional # from silnlp.common.environment import SIL_NLP_ENV
This comment can be removed.
silnlp/common/clean_projects.py
line 19 at r7 (raw file):
# --- Global Constants --- # PROJECTS_FOLDER_DEFAULT = SIL_NLP_ENV.pt_projects_dir
Same with this one.
silnlp/common/clean_projects.py
line 468 at r7 (raw file):
try: with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
I am curious about using threads for this task. Have you seen it improve the throughput of the script? I would have guessed that cleaning up these projects would be IO-limited, rather than CPU-limited. Multi-threading often introduces difficult-to-spot errors, so I think it's usually worth avoiding unless the benefits are clear.
Thanks for the review @benjaminking I had Gemini create a benchmarking test and the results showed that it was almost twice as fast on my local machine using multithreading. That makes a difference for me when I'm running this locally prior to uploading a project. F:\GitHub\silnlp>poetry run python -m silnlp.common.clean_projects_benchmarking --num-projects 1000 --num-files-per-project 100 --dry-run --- Starting Multithreaded Benchmark --- --- Starting Single-threaded Benchmark --- --- Benchmark Results --- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benjaminking reviewed 1 of 1 files at r8, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @davidbaines)
silnlp/common/clean_projects.py
line 468 at r7 (raw file):
Previously, benjaminking (Ben King) wrote…
I am curious about using threads for this task. Have you seen it improve the throughput of the script? I would have guessed that cleaning up these projects would be IO-limited, rather than CPU-limited. Multi-threading often introduces difficult-to-spot errors, so I think it's usually worth avoiding unless the benefits are clear.
Ok. It seems like you are seeing benefits from the multi-threading.
Updates to common.clean_projects to use multi threading and improvements to the logging.
This change is