Could you give me some clues on reproducing the CLIP feature alignment from DR.Robot?

I'm trying to reimplement the "Text to Robot Pose with CLIP" of paper but haven't achieved the same results.

I've been attempting to match the conditions outlined in the paper. When training the shadow hand model, I used:
```
python generate_robot_data.py --model_xml_dir mujoco_menagerie/shadow_hand --camera_distance_factor 0.4
python train.py --dataset_path data/shadow_hand --experiment_name shadow_hand --canonical_training_iterations 5000 --pose_conditioned_training_iterations 30_000
```
I also wrote a script for aligning CLIP features using the 🤗 openai/clip-vit-base-patch32 encoder.
The initial pose is come from ```get_canonical_pose``` at ```utils.mujoco_utils```, and I use Adam to optimize it.
The loss function is dot product between language and image embeddings
```
loss = -torch.matmul(image_embedding, text_features.T.detach())
```
I've noticed some oddities. Firstly, the loss function starts off much lower than what was reported in the paper. On the webpage, I saw that the initial error value was around -24, but my reproduction yields a value below -30. I suspect this has something to do with the prompts. Secondly, it's difficult to optimize to the desired pose.
![loss](https://github.com/user-attachments/assets/53283b65-0293-430e-92d4-e785eb3cc63c)
![render ](https://github.com/user-attachments/assets/6987f9d0-6bce-49bd-96a6-2e82c65b9f67)

Therefore, I would like to know more about the implementation details regarding this part, such as the optimizer settings or if any additional tricks were used, etc. Could you please share that with me?

I'm looking forward to receiving your reply.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could you give me some clues on reproducing the CLIP feature alignment from DR.Robot? #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Could you give me some clues on reproducing the CLIP feature alignment from DR.Robot? #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions