Skip to content

Conversation

@MikoMikarro
Copy link

@MikoMikarro MikoMikarro commented Feb 1, 2025

Created https://github.com/darktable-org/darktable/pull/18331so people can try the tool. It is not finished at all and is a very very very WIP.

As I mentioned before, object detection is not correctly integrated with the processing pipeline. For creating the onnx model you can use the collab at https://github.com/ChuRuaNh0/FastSam_Awsome_TensorRT. To get the ONNX library, it can be downloaded from their github releases https://github.com/microsoft/onnxruntime/releases/tag/v1.20.1

@MikoMikarro MikoMikarro mentioned this pull request Feb 1, 2025
@MikoMikarro MikoMikarro marked this pull request as draft February 1, 2025 12:18
@jenshannoschwalm
Copy link
Collaborator

I had a look at your video some days ago and right now a quick glance into the code provided here, not easy to read.
Please understand, that i have never touched any AI code before, don't know anything about the models, have not even been inclined to use any AI at all but just have a profound idea what we can do and how things can be done in darktable code. Not sure yet if we want such AI masks at all, but i think it's worthwhile to do a fully working "prototype" that can be merged into master for broader testing - we might later decide to drop it if quality/results are not good enough. OK?

As the integration into dt pipe code might be tricky i would be willing to provide that code so you could concentrate on AI model and UI if things become clear to me.

From what i understand so far

  1. You require full image RGB data, pass those data to an AI algorithm that does some segmentation analysis and returns an array having the same size as the image but for every pixel location this array holds some sort of segment identifier. As this analysis is costly we would want that segment-array to be cached.
  2. You take a location/area from the UI and use the segment-array to calculate a darktable mask which is used for blending "as usual".

Could you comment if my above understanding is correct or provide some explanation?

@MikoMikarro
Copy link
Author

Thanks for giving it a look @jenshannoschwalm !

I had a look at your video some days ago and right now a quick glance into the code provided here, is not easy to read.

Definitely AI code in C is kind of "cursed" as it seems like everyone likes it to be executed in Python environments. I understand there is no plan to add python as a dependency to DT and that is why I decided to try keeping it in an entirely C-based approach.

Not sure yet if we want such AI masks at all, but i think it's worthwhile to do a fully working "prototype" that can be merged into master for broader testing - we might later decide to drop it if quality/results are not good enough. OK?

Totally aligns with the Idea I had for the project, proof-of-concept, call it the name you want :).

As the integration into dt pipe code might be tricky i would be willing to provide that code so you could concentrate on AI model and UI if things become clear to me.

Perfect! All the cache system, worker system, and multiple dev + pipes got me slightly confused on how to implement the feature, so help in that regard is the only I can ask for. Maybe in future iterations, I'll also have to read the build system so I can make a less "my-environment" dependent build.

From what i understand so far

  1. You require full image RGB data, pass those data to an AI algorithm that does some segmentation analysis and returns an array having the same size as the image but for every pixel location this array holds some sort of segment identifier. As this analysis is costly we would want that segment array to be cached.

Indeed, the idea would be that the first time you create an AI Mask, some tasks generate a full-resolution RGB image of the original image. This image will get resized to 1024x1024 and separated in their RGB channels to be given as input for the FAST-SAM model.

The model returns $N$ amount of masks which have a real resolution of 256x256. My implementation just results in a bi-linear interpolation of the 256x256 image to the original size but I think that part should not be necessary as I could always back-transform from the original resolution to the 256x256 array and therefore save a lot of memory.

These masks would be awesome to be cached once they've been created and even stored (as the thumbnails do) so they don't need to be created again. Preventing creation again is also important because the AI model is not deterministic and each time the masks could change, which would break the "designer" capability for the user. Further implementations could include a GPU mode if the user has installed cuDNN, but should not be mandatory.

  1. You take a location/area from the UI and use the segment-array to calculate a darktable mask which is used for blending "as usual".

Yes, I would need to access the masks in some way from the "object" mask so the blending can be performed.

Could you comment if my above understanding is correct or provide some explanation?

Do you want me to also give a further explanation on how the masks are selected and computed?

@jenshannoschwalm
Copy link
Collaborator

Yes please.

@MikoMikarro
Copy link
Author

Okay! Let's walk through how the code works then. I currently detect that the inference must be done once the _pixelpipe_process_on_CPU function is executed. After reading the pull request it seems that some changes got mixed with a merge that I did with the upstream master but I'll ignore those changes for now. (Maybe creating again the branch with only the changes of this feature would be advisable 😅 )

I added to the dt_dev_pixelpipe_t struct inside develop/pixelpipe_hb.h some extra fields to contain the data of the masks. I called it 'proxy' because my first idea was to store there just the RGB proxy I commented on before. In the end, I ended up storing there the masks.

Now, let's go to line 1222 of develop/pixelpipe_hb.c. There, the inference of the model is being prepared.

It starts checking in the pipe if there is an image of at least some size and if the masks were not generated yet. I had some segfaults during testing and I thought some concurrency problems might happen so I created a "local" copy of the back buffer of the pipe (assumed RGBA) and copied only the RGB values. Then, the image is swapped from XYC(coordinate x, coordinate y, color) to CXY as the input of the NN requires it to be in that format. Then, the onnx runtime is prepared. Not 100% sure of what all the options are as I just got them from the ONNX runtime examples.

Then we enter the run_inferenece function, which fills the masks with a desired size using the interpolated scaling (right now as I told you, the resize is done here but we could just store the masks in 256x256 sizes and perform the interpolation from the blending operator.

run_inference resizes the image to 1024x1024 as is the input for the model. It is also important to get that the values of RGB need to go from 0.0 to 1.0 for the inference. Then, the execution of the model is performed with the selected input and output layers. From the 'reverse engineering' I did of the FAST SAM example, the important output layers are the first and the last. (from the official paper it seems like the others can also provide some information but it's not important for now).

These layers are sent to the prep_out_data, which gets the output of the layers and computes the masks themselves. The first layer has a shape 1 x 37 x 21500+ (last value take it or leave it, but it is a huge number), and contains 21500 bounding boxes for mask candidates. The 1 x is there because we perform inference in only one image, but technically you should be able to perform multiple inferences at the same time.

The dimension of 37 has the following indexes:
x y w h c m1 m2 m3 ... m32

X, Y, W, and H are the coordinates of the position and size of the bounding box (centered in X and Y). C is the confidence for the bounding box. M1 to M32 are multipliers that will make sense later.

From this resulting list, we get the bounding boxes that have a greater level of confidence than the threshold given. 0.3 is a typical value but can probably be raised for "more stable" masks. Then, another filter is applied to the bounding boxes called "Non-Max Suppression" which gets two bounding boxes that are similar in terms of the area they occupy ( using the Interesection-over-Union metric ) and discards the one with the lower confidence.

Finally, from each "surviving" bounding box we're going to get the masks they represent. These masks are encoded on the M1 to M32 multipliers. The last output layer has the shape 32 x 256 x 256. Each of the 256 x 256 "images" is called proto and when multiplied by the Mx value and added for a given bounding box ( that's what is being executed in process_mask_native), the resulting image is the mask you want (also need to filter everything out of the bounding box itself.

And there we have it! From the 21500+ bounding boxes, the system filters by confidence and NMS and finally computes the mask for the $N$ bounding boxes using the 32 protos that are given by the final layer. Other layers in principle also create more bounding boxes but I haven't explored those yet. This infographic from the original FastSAM repository shows some representation of what I tried to say by text here.

These masks are stored on the struct and are used by the mask operator.

@MikoMikarro
Copy link
Author

I created a "cleaner" branch and pull request with the changes that are only relevant for the feature itself: #18356

@MikoMikarro MikoMikarro closed this Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants