How to pretrain the vision encoder?

We need to continue unified structure learning on the Docowl1.5-stage1 model using some private data, followed by LoRA fine-tuning for the Document Parsing task. We modified the parameters in the 'finetune-docowl.sh' script based on the original paper, setting tune_vision2text=True, freeze_vision_model=False, and freeze_base_model=True in order to perform unified structure learning. After this stage, the model was able to perform inference normally. We then proceeded with fine-tuning the model for the 'Document Parsing' task using the 'finetune-docowl_lora.sh' script, aiming to further improve its performance. During this fine-tuning process, the model’s loss decreased as expected. However, after applying LoRA fine-tuning, the model's inference results became confused. That said, we were able to achieve the desired results by directly applying LoRA fine-tuning to the Docowl1.5 S1 model.

We would appreciate any suggestions you might have regarding our experimental design to help us achieve the expected results.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to pretrain the vision encoder? #124

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to pretrain the vision encoder? #124

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions