-
Couldn't load subscription status.
- Fork 1.2k
Object mask #18331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Object mask #18331
Conversation
…of the marker to get the correct position of the mask"
|
I had a look at your video some days ago and right now a quick glance into the code provided here, not easy to read. As the integration into dt pipe code might be tricky i would be willing to provide that code so you could concentrate on AI model and UI if things become clear to me. From what i understand so far
Could you comment if my above understanding is correct or provide some explanation? |
|
Thanks for giving it a look @jenshannoschwalm !
Definitely AI code in C is kind of "cursed" as it seems like everyone likes it to be executed in Python environments. I understand there is no plan to add python as a dependency to DT and that is why I decided to try keeping it in an entirely C-based approach.
Totally aligns with the Idea I had for the project, proof-of-concept, call it the name you want :).
Perfect! All the cache system, worker system, and multiple dev + pipes got me slightly confused on how to implement the feature, so help in that regard is the only I can ask for. Maybe in future iterations, I'll also have to read the build system so I can make a less "my-environment" dependent build.
Indeed, the idea would be that the first time you create an AI Mask, some tasks generate a full-resolution RGB image of the original image. This image will get resized to The model returns These masks would be awesome to be cached once they've been created and even stored (as the thumbnails do) so they don't need to be created again. Preventing creation again is also important because the AI model is not deterministic and each time the masks could change, which would break the "designer" capability for the user. Further implementations could include a GPU mode if the user has installed cuDNN, but should not be mandatory.
Yes, I would need to access the masks in some way from the "object" mask so the blending can be performed.
Do you want me to also give a further explanation on how the masks are selected and computed? |
|
Yes please. |
|
Okay! Let's walk through how the code works then. I currently detect that the inference must be done once the I added to the Now, let's go to line 1222 of develop/pixelpipe_hb.c. There, the inference of the model is being prepared. It starts checking in the pipe if there is an image of at least some size and if the masks were not generated yet. I had some segfaults during testing and I thought some concurrency problems might happen so I created a "local" copy of the back buffer of the pipe (assumed RGBA) and copied only the RGB values. Then, the image is swapped from XYC(coordinate x, coordinate y, color) to CXY as the input of the NN requires it to be in that format. Then, the onnx runtime is prepared. Not 100% sure of what all the options are as I just got them from the ONNX runtime examples. Then we enter the
These layers are sent to the The dimension of 37 has the following indexes: X, Y, W, and H are the coordinates of the position and size of the bounding box (centered in X and Y). C is the confidence for the bounding box. M1 to M32 are multipliers that will make sense later. From this resulting list, we get the bounding boxes that have a greater level of confidence than the threshold given. 0.3 is a typical value but can probably be raised for "more stable" masks. Then, another filter is applied to the bounding boxes called "Non-Max Suppression" which gets two bounding boxes that are similar in terms of the area they occupy ( using the Interesection-over-Union metric ) and discards the one with the lower confidence. Finally, from each "surviving" bounding box we're going to get the masks they represent. These masks are encoded on the M1 to M32 multipliers. The last output layer has the shape 32 x 256 x 256. Each of the 256 x 256 "images" is called proto and when multiplied by the Mx value and added for a given bounding box ( that's what is being executed in And there we have it! From the 21500+ bounding boxes, the system filters by confidence and NMS and finally computes the mask for the These masks are stored on the struct and are used by the mask operator. |
|
I created a "cleaner" branch and pull request with the changes that are only relevant for the feature itself: #18356 |
Created https://github.com/darktable-org/darktable/pull/18331so people can try the tool. It is not finished at all and is a very very very WIP.
As I mentioned before, object detection is not correctly integrated with the processing pipeline. For creating the onnx model you can use the collab at https://github.com/ChuRuaNh0/FastSam_Awsome_TensorRT. To get the ONNX library, it can be downloaded from their github releases https://github.com/microsoft/onnxruntime/releases/tag/v1.20.1