Operators

These are modules that operate on media items and help analyse text, image, video, audio etc. These act as plugin code that are only loaded if specified in the config.yml file. Operators define ways in which you can manipulate data that your search engine wants to operate on

This Wiki page lists the description of each operator briefly.
Each Operator has a unit test file, requirements.in and a requirements.txt file that contains. The requirements file stores information on all the packages required to run the operator.

Audio Vector Embeddings (audio_vec_embedding.py)

Given an audio file, this methods finds a vector of 2048 dimensions using PANNs. PANN is a CNN that is pre-trained on lot of audio files. They have been used for audio tagging and sound event detection. The PANNs have been used to fine-tune several audio pattern recognition tasks, and have outperformed several state-of-the-art systems.

Embeddings for vector audio search

Audio embeddings are often generated using spectrograms or other audio signal features. In the context of audio signal processing for machine learning, the process of feature extraction from spectrograms is a crucial step. Spectrograms are visual representations of the frequency content of audio signals over time. The identified features in this context encompass three specific types:

Mel-frequency cepstral coefficients (MFCCs)
Chroma features: Chroma features represent the 12 distinct pitch classes of the musical octave and are particularly useful in music-related tasks.
Spectral contrast: Spectral contrast focuses on the perceptual brightness of different frequency bands within an audio signal.

How to Run the Test

The operator and the test file can be found at src/core/operators folder in the codebase. The operator is named audio_vec_embedding.py and the test file is named test_audio_vec_embedding.py

To run the test, simply just run the test file

python -m unittest test_audio_vec_embedding.py

Object Detection using YOLO (detect_objects.py)

You only look once (YOLO) is a state-of-the-art, real-time object detection system. It is trained on the COCO dataset.

We use the segment model of YOLO -> YOLOv8 segment - https://docs.ultralytics.com/tasks/segment/#models an code example of YOLO object detection

from ultralytics import YOLO
model = YOLO('yolov8n-seg.pt')
result = model.predict('path/to/your/image', save=True, imgsz=1024, conf=0.5, project='sample_data', name='output')

The output image will be saved in sample_data/output folder, and the resulting image will be titled as output.png. This image will have bounding boxes with objects detected and will also show the segmented area.

How to Run the Test

The operator and the test file can be found at src/core/operators folder in the codebase. The operator is named detect_objects.py and the test file is named test_detect_objects.py

To run the test, simply just run the test file

python -m unittest test_detect_objects.py

This will initiate the test and first the YOLO models .pt file will be downloaded, after the tests runs, you should get an OK message in the terminal indicating that the test has run successfully.

The output image will be saved in sample_data/output folder, and the resulting image will be titled as output.png. This image will have bounding boxes with objects detected and will also show the segmented area.

Tesseract OCR Operator (detect_text_in_image_tesseract.py)

For each language support, we need to install separate tesseract operators for each language. Right now the current operator only supports English and Hindi languages.

For Linux, you can follow these links to understand how and what modules to install for each language.

To extract text from an image, we pass the image through a tesseract function like this

data = pytesseract.image_to_string(image, lang='eng+hin', config='--psm 6 --oem 1')

Here the config settings help us define some more insight into the image and LSTM blocks for the image extraction engines.

You can take a look at the operator and the test of the operator for the entire code.

How to Run the Test

The operator and the test file can be found at src/core/operators folder in the codebase. The operator is named detect_text_in_image_tesseract.py and the test file is named test_detect_text_in_image_tesseract.py

To run the test, simply just run the test file

python -m unittest test_detect_text_in_image_tesseract.py

The test will check if text was extracted correctly or not, it will fetch an sample image from the sample_data folder. You should get an OK message in the terminal indicating that the test has run successfully.

Audio Embedding Operator (LAION CLAP Model) (audio_vec_embedding_clap.py)

The LAION CLAP (Contrastive Language-Audio Pretraining) model is a sophisticated language-audio model trained using contrastive learning. This approach allows the model to learn a joint representation of audio and text modalities, enabling seamless interaction between the two.

Architecture Overview:

Audio Encoder:
- The audio encoding process is handled by a Hierarchical Token-Semantic Audio Transformer (HTSAT) model, which is composed of four Swin-Transformer blocks.
- The output of the audio encoder is a 768-dimensional vector, capturing essential audio features.
Text Encoder:
- For text encoding, the model employs the RoBERTa model, which is widely recognized for its robust natural language processing capabilities.
Projection Layers:
- The penultimate layer of the architecture includes two Multi-Layer Perceptron (MLP) layers with ReLU activation. These layers project the audio and text embeddings to a common 512-dimensional space, which serves as the final representation during training.

Audio Data Processing:

Input Specifications:
- Each audio input is 10 seconds long, processed with a hop size of 480 and a window size of 1024.
- The Short-Time Fourier Transform (STFT) and mel-spectrograms are computed using 64 mel-bins.
- This preprocessing results in an audio input shape of (T = 1024, F = 64) before it is passed to the audio encoder.

How to Run the Test:

The LAION CLAP operator and test files are located in the src/core/operators folder within the codebase. The operator is named audio_vec_embedding_clap.py, and the corresponding test file is test_audio_vec_embedding_clap.py.

To run the test, use the following command:

python -m unittest test_audio_vec_embedding_clap.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Operators

Audio Vector Embeddings (audio_vec_embedding.py)

Embeddings for vector audio search

How to Run the Test

Object Detection using YOLO (detect_objects.py)

How to Run the Test

Tesseract OCR Operator (detect_text_in_image_tesseract.py)

How to Run the Test

Audio Embedding Operator (LAION CLAP Model) (audio_vec_embedding_clap.py)

Architecture Overview:

Audio Data Processing:

How to Run the Test:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Feluda Wiki

Setup Guides

Modules

Other Misc

Learning

Clone this wiki locally