Object mask #18331

MikoMikarro · 2025-02-01T11:58:51Z

Created https://github.com/darktable-org/darktable/pull/18331so people can try the tool. It is not finished at all and is a very very very WIP.

As I mentioned before, object detection is not correctly integrated with the processing pipeline. For creating the onnx model you can use the collab at https://github.com/ChuRuaNh0/FastSam_Awsome_TensorRT. To get the ONNX library, it can be downloaded from their github releases https://github.com/microsoft/onnxruntime/releases/tag/v1.20.1

…of the marker to get the correct position of the mask"

jenshannoschwalm · 2025-02-02T14:11:12Z

I had a look at your video some days ago and right now a quick glance into the code provided here, not easy to read.
Please understand, that i have never touched any AI code before, don't know anything about the models, have not even been inclined to use any AI at all but just have a profound idea what we can do and how things can be done in darktable code. Not sure yet if we want such AI masks at all, but i think it's worthwhile to do a fully working "prototype" that can be merged into master for broader testing - we might later decide to drop it if quality/results are not good enough. OK?

As the integration into dt pipe code might be tricky i would be willing to provide that code so you could concentrate on AI model and UI if things become clear to me.

From what i understand so far

You require full image RGB data, pass those data to an AI algorithm that does some segmentation analysis and returns an array having the same size as the image but for every pixel location this array holds some sort of segment identifier. As this analysis is costly we would want that segment-array to be cached.
You take a location/area from the UI and use the segment-array to calculate a darktable mask which is used for blending "as usual".

Could you comment if my above understanding is correct or provide some explanation?

MikoMikarro · 2025-02-03T09:11:52Z

Thanks for giving it a look @jenshannoschwalm !

I had a look at your video some days ago and right now a quick glance into the code provided here, is not easy to read.

Definitely AI code in C is kind of "cursed" as it seems like everyone likes it to be executed in Python environments. I understand there is no plan to add python as a dependency to DT and that is why I decided to try keeping it in an entirely C-based approach.

Not sure yet if we want such AI masks at all, but i think it's worthwhile to do a fully working "prototype" that can be merged into master for broader testing - we might later decide to drop it if quality/results are not good enough. OK?

Totally aligns with the Idea I had for the project, proof-of-concept, call it the name you want :).

As the integration into dt pipe code might be tricky i would be willing to provide that code so you could concentrate on AI model and UI if things become clear to me.

Perfect! All the cache system, worker system, and multiple dev + pipes got me slightly confused on how to implement the feature, so help in that regard is the only I can ask for. Maybe in future iterations, I'll also have to read the build system so I can make a less "my-environment" dependent build.

From what i understand so far

You require full image RGB data, pass those data to an AI algorithm that does some segmentation analysis and returns an array having the same size as the image but for every pixel location this array holds some sort of segment identifier. As this analysis is costly we would want that segment array to be cached.

Indeed, the idea would be that the first time you create an AI Mask, some tasks generate a full-resolution RGB image of the original image. This image will get resized to 1024x1024 and separated in their RGB channels to be given as input for the FAST-SAM model.

The model returns $N$ amount of masks which have a real resolution of 256x256. My implementation just results in a bi-linear interpolation of the 256x256 image to the original size but I think that part should not be necessary as I could always back-transform from the original resolution to the 256x256 array and therefore save a lot of memory.

These masks would be awesome to be cached once they've been created and even stored (as the thumbnails do) so they don't need to be created again. Preventing creation again is also important because the AI model is not deterministic and each time the masks could change, which would break the "designer" capability for the user. Further implementations could include a GPU mode if the user has installed cuDNN, but should not be mandatory.

You take a location/area from the UI and use the segment-array to calculate a darktable mask which is used for blending "as usual".

Yes, I would need to access the masks in some way from the "object" mask so the blending can be performed.

Could you comment if my above understanding is correct or provide some explanation?

Do you want me to also give a further explanation on how the masks are selected and computed?

jenshannoschwalm · 2025-02-03T09:16:26Z

Yes please.

MikoMikarro · 2025-02-03T14:26:59Z

Okay! Let's walk through how the code works then. I currently detect that the inference must be done once the _pixelpipe_process_on_CPU function is executed. After reading the pull request it seems that some changes got mixed with a merge that I did with the upstream master but I'll ignore those changes for now. (Maybe creating again the branch with only the changes of this feature would be advisable 😅 )

I added to the dt_dev_pixelpipe_t struct inside develop/pixelpipe_hb.h some extra fields to contain the data of the masks. I called it 'proxy' because my first idea was to store there just the RGB proxy I commented on before. In the end, I ended up storing there the masks.

Now, let's go to line 1222 of develop/pixelpipe_hb.c. There, the inference of the model is being prepared.

It starts checking in the pipe if there is an image of at least some size and if the masks were not generated yet. I had some segfaults during testing and I thought some concurrency problems might happen so I created a "local" copy of the back buffer of the pipe (assumed RGBA) and copied only the RGB values. Then, the image is swapped from XYC(coordinate x, coordinate y, color) to CXY as the input of the NN requires it to be in that format. Then, the onnx runtime is prepared. Not 100% sure of what all the options are as I just got them from the ONNX runtime examples.

Then we enter the run_inferenece function, which fills the masks with a desired size using the interpolated scaling (right now as I told you, the resize is done here but we could just store the masks in 256x256 sizes and perform the interpolation from the blending operator.

run_inference resizes the image to 1024x1024 as is the input for the model. It is also important to get that the values of RGB need to go from 0.0 to 1.0 for the inference. Then, the execution of the model is performed with the selected input and output layers. From the 'reverse engineering' I did of the FAST SAM example, the important output layers are the first and the last. (from the official paper it seems like the others can also provide some information but it's not important for now).

These layers are sent to the prep_out_data, which gets the output of the layers and computes the masks themselves. The first layer has a shape 1 x 37 x 21500+ (last value take it or leave it, but it is a huge number), and contains 21500 bounding boxes for mask candidates. The 1 x is there because we perform inference in only one image, but technically you should be able to perform multiple inferences at the same time.

The dimension of 37 has the following indexes:
x y w h c m1 m2 m3 ... m32

X, Y, W, and H are the coordinates of the position and size of the bounding box (centered in X and Y). C is the confidence for the bounding box. M1 to M32 are multipliers that will make sense later.

From this resulting list, we get the bounding boxes that have a greater level of confidence than the threshold given. 0.3 is a typical value but can probably be raised for "more stable" masks. Then, another filter is applied to the bounding boxes called "Non-Max Suppression" which gets two bounding boxes that are similar in terms of the area they occupy ( using the Interesection-over-Union metric ) and discards the one with the lower confidence.

Finally, from each "surviving" bounding box we're going to get the masks they represent. These masks are encoded on the M1 to M32 multipliers. The last output layer has the shape 32 x 256 x 256. Each of the 256 x 256 "images" is called proto and when multiplied by the Mx value and added for a given bounding box ( that's what is being executed in process_mask_native), the resulting image is the mask you want (also need to filter everything out of the bounding box itself.

And there we have it! From the 21500+ bounding boxes, the system filters by confidence and NMS and finally computes the mask for the $N$ bounding boxes using the 32 protos that are given by the final layer. Other layers in principle also create more bounding boxes but I haven't explored those yet. This infographic from the original FastSAM repository shows some representation of what I tried to say by text here.

These masks are stored on the struct and are used by the mask operator.

MikoMikarro · 2025-02-04T12:56:22Z

I created a "cleaner" branch and pull request with the changes that are only relevant for the feature itself: #18356

MikoMikarro added 19 commits January 1, 2025 19:24

basic UI integration

17a5169

new mask as source of coordinates, created new brush

6f5608f

add onnxruntime

81688d7

added cmake to include onnx

c3cfe6b

added mask button

6d8094c

added basic mask creation and loading

91ade37

Merge commit '17a5169284f7908c6b8e98281155b458bfe5ce6a' into object_mask

6300dc6

base commit

ff417f3

Merge commit '6f5608f690bae97ffaec77ef275d979ddc2501a0' into object_mask

3748f2a

Merge commit '81688d728298635ce9998cdd8e9aee9c155635c8' into object_mask

b941ff9

Merge commit 'c3cfe6b55462ccbb0530545d9a1e86364c71afc4' into object_mask

8b439e6

Merge commit '6d8094c901a5d0244685b61a51676c7a883f191b' into object_mask

4496f1b

Merge commit '91ade3730d0abb59ffe9c6dbf0996d8a62c2e520' into object_mask

9f34c77

reconstruction of commit "8532d72"

44427fb

reconstruction of commit ee3db7f "corrected mask loading and scaling …

73255a4

…of the marker to get the correct position of the mask"

update onnxruntime version to last release

2941b47

delete section for AI masks as they will be just a brush

7c8819a

Cleaned code into separate files

03ee389

updated pixepiple cause half merge not well done

79a459a

MikoMikarro mentioned this pull request Feb 1, 2025

AI Masks #12295

Open

MikoMikarro marked this pull request as draft February 1, 2025 12:18

recover AI button from previous commit

2c2b21e

MikoMikarro mentioned this pull request Feb 4, 2025

Object mask clean #18356

Open

MikoMikarro closed this Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Object mask #18331

Object mask #18331

Uh oh!

MikoMikarro commented Feb 1, 2025 •

edited

Loading

Uh oh!

jenshannoschwalm commented Feb 2, 2025

Uh oh!

MikoMikarro commented Feb 3, 2025

Uh oh!

jenshannoschwalm commented Feb 3, 2025

Uh oh!

MikoMikarro commented Feb 3, 2025

Uh oh!

MikoMikarro commented Feb 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Object mask #18331

Object mask #18331

Uh oh!

Conversation

MikoMikarro commented Feb 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jenshannoschwalm commented Feb 2, 2025

Uh oh!

MikoMikarro commented Feb 3, 2025

Uh oh!

jenshannoschwalm commented Feb 3, 2025

Uh oh!

MikoMikarro commented Feb 3, 2025

Uh oh!

MikoMikarro commented Feb 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MikoMikarro commented Feb 1, 2025 •

edited

Loading