All, or almost all, of the projects carried out during the university course of Data Science at Milano-Bicocca University.
- Audio classification and image classification and matching
- Collecting user annotations for training neural network-based object detectors
- Content relations between google news and twitter posts
- Do Cov-19 mobility restrictions affect your music and movie choices?
- Europe air quality exploration
- Exploration of drugs' reviews
- Football teams performance indicators clustering
- Fruit classification and segmentation with transfer-learning neural network approach
- Fundus image classification and processing for diabetic retinopathy
- Hourly energy consumption time series prediction
- Survey on job and contractual opportunities for graduates using ISTAT data
- Multi-domain claim detection: a coreset and an external feature based approach for automated fact-checking
- Turntable 3D modelling
- Understanding semantic perception of cities in society
- Unimib energy consumption descriptive and predictive analysis
- Course: Digital signal and image preprocessing
- Tools and techniques: Python, Tensorflow, convolutional neural network, image features descriptors
- Worked on: Whole project, especially in image matching and retrieval task
- Description: This project explores different image matching and retrieval techniques to identify correspondences between historical photographs of Amsterdam dating back to 1900 and modern images of the same places. Images are analysed using a range of representation techniques, including SIFT descriptors, VLADs and convolutional neural networks (CNNs), with a focus on the use of siamese architectures for visual similarity-based training. The comparison of the methods highlights the advantages and limitations of each approach, providing useful insights for historical reconstruction and visual heritage conservation applications.
- Open PDF
- Course: Bachelor's degree in Computer Science (Milano-Bicocca University)
- Tools and techniques: Python, Matlab, Flask, MongoDB, Android Studio
- Description: The following report describes the creation of a client-server system for requesting the detection of objects on images provided by the user. The primary goal was to create a system capable of improving or creating new detectors after the use of users, whose results must then be collected and stored. The technology of the YOLO neural network, used for object recognition, and its implementation will be analyzed. Project structure and development components will be described in the course of the report in parallel with the different problems each of them faces. Finally, the realization of an interface intended for the user, of which are described the different features that together lead to the achievement of the initial intent. Lastly, the results achieved are presented, in terms of time and computational resources, and also of accuracy degree and potential offered.
- Open PDF
- Course: Master's degree stage
- Tools and techniques: Python, Tensorflow, web scraping, text representation techniques (TF-IDF, LLMs, Word2Vec), PlotLy
- Description: This study presents a descriptive analysis aimed at identifying semantic relationships between contents published on Google News and trending topics on Twitter. Data were collected through scraping from the two platforms and three different text representation approaches were used: TF-IDF, Word2Vec and BERT. Relationships are extracted by calculating cosine similarity, both between individual texts and by exploiting the aggregation provided by the platforms. The visualization of the links was achieved through a Sankey diagram, which is effective for highlighting the connections between contents. Finally, the work discusses the results obtained for the different experiments, describing the limitations of the approaches used and suggesting possible improvements. This study represents an attempt to describe the relationships between traditional media and social media and offers a possible starting point for subsequent analyses of information flows between different platforms.
- Open PDF
- Course: Data management and visualization
- Tools and techniques: Python, web scraping, Spotify API, IMDB API, Apache Kafka, MongoDB, Tableau
- Worked on: Retrieval, processing, db management and integration + final visualization for Netflix data
- Description: This study analyzes the impact of mobility restrictions, introduced during the COVID-19 lockdown, on emotional and sentimental preferences of content searched on platforms such as Spotify and Netflix. The data, collected from different sources and integrated into a MongoDB database in deferred streaming, are processed to extract statistical indicators that highlight any correlations between the intensity of restrictions and the choice of content. The resulting information is displayed in interactive dashboards, which allow to monitor and analyze the phenomena by country, offering a detailed view of the aggregated emotional responses.
- Open PDF
- Course: Big data in geographic information systems
- Tools and techniques: Python, Geo-Pandas, Self-Organizing-Map
- Description: This project examines the levels of air pollutants (PM-10, NO, CO, SO₂) in Europe, using data collected from monitoring stations from 1998 to date. The analysis includes an assessment of overall trends and seasonal and weekly variations for each pollutant, providing a detailed view of temporal concentration patterns. To identify geographical areas with similar trends, a clustering technique based on Self Organizing Maps (SOM) is applied, which groups locations with common concentration trajectories of the substances under consideration. This approach allows to identify similarities based on environmental exposure dynamics and shared seasonal behaviors, highlighting regions with comparable risk profiles and pollution characteristics.
- Open PDF
- Course: Text mining and search
- Tools and techniques: Python, Word2Vec, text classification and text clustering, strong text preprocesing techniques
- Worked on: Text processing, cluster analysis
- Description: In this study, clustering techniques are applied to a dataset of textual reviews to group drugs based on the meaning of terms present in user comments. The clustering process uses semantic analysis to identify similarities in feedback, allowing to aggregate drugs that share similarly perceived characteristics. Once the clusters are formed, the peculiarities of each group are examined, with a particular focus on the most representative terms and adjectives, which emerge as key indicators of users' experiences and opinions. This approach allows a structured view of common perceptions about drugs, offering relevant insights for qualitative analysis.
- Open PDF
- Course: Machine learning
- Tools and techniques: Python, KNIME, KPIs construction, cluster analysis
- Worked on: KPIs construction, cluster analysis
- Description: Using a dataset that details the performance of each player in each match, this project develops a set of indicators based on the success of specific actions in different areas of the pitch. These indicators are used in a clustering analysis to identify distinct levels of skill and tactical contribution for each role, allowing players to be segmented based on their relative effectiveness. The clustering results are visualized through interactive dashboards that facilitate the analysis of the potential strengths and weaknesses of opposing teams, supporting strategic decisions based on the strengths and vulnerabilities of the various positions on the pitch.
- Open PDF
- Course: Advanced machine learning
- Tools and techniques: Python, Tensorflow, image preprocessing, convolutional neural network
- Worked on: Whole project
- Description: The project focuses on the automatic classification of a large set of images of different types of fruit, using advanced transfer learning techniques applied to convolutional neural networks (CNN). To improve the accuracy and generalization of the model, data augmentation techniques are adopted, which increase the variety and robustness of the training data. To complete the classification, image segmentation techniques are implemented, to accurately identify and distinguish fruits in realistic and complex contexts.
- Open PDF
- Course: Data science lab in biosciences
- Tools and techniques: Python, Tensorflow, text processing and information retrieval, convolutional neural networks, strong image elaboration techniques
- Description: The project explores methods for the identification of diabetic retinopathy using a dataset of fundus images. Two main approaches are compared: the first is based on standard convolutional neural networks (CNNs), while the second integrates a specific image preprocessing, aimed at emphasizing the retinal areas where the symptoms of the pathology are more common. This preprocessing has been designed based on information extracted from a textual analysis of specific scientific articles, obtained from PubMed, describing the main features of diabetic retinopathy. The targeted approach aims to improve diagnostic accuracy and to focus the model attention on the most relevant regions for the early identification of symptoms.
- Open PDF
- Course: Time series analysis
- Tools and techniques: Python, Statsmodel, time series decomposition analysis and SARIMAX-based model prediction
- Description: This project focuses on the analysis of a time series of hourly energy consumption, using decomposition techniques to isolate the different seasonal components and understand the periodic behavior of energy consumption. The decomposition allows to break down the series into different levels of seasonality and trend, providing an in-depth description of the consumption patterns. Based on the autocorrelation plots, a SARIMAX model is developed, integrating multi-level seasonal components and a regression on the overall trend, to obtain accurate forecasts of future energy consumption levels.
- Open PDF
- Course: Statistics
- Tools and techniques: R Studio
- Description: This study uses ISTAT data to examine Italian students' choices of educational paths and their subsequent occupations. The analysis focuses on differences in university choices and job opportunities, exploring variations across different Italian regions and research fields. Through comparisons based on statistical tests, the project aims to identify trends in educational preferences and subsequent careers, providing insight into local dynamics and regional influences on students' career paths.
- Open PDF
Multi-domain claim detection: a coreset and an external feature based approach for automated fact-checking
- Course: Master thesis work
- Tools and techniques: Python, Tensorflow, Large Language Models (LLMs), strong text preprocessing techniques for feature extraction
- Description: This study focuses on the initial phase of automated fact-checking, the claim detection task, with the aim of developing a model capable of effectively generalizing across different linguistic styles and thematic domains while maintaining a sustainable computational complexity. By exploiting the CheckThat! competition datasets published in the last five years, we propose a coreset-based approach to reduce the size of the datasets without compromising their representativeness. This approach allows combining information from different domains, creating a smaller but equally representative training dataset. The use of coresets then allows the adoption of more complex models, integrating not only the representations generated by Large Language Models (LLM), but also external features, such as key concept definitions and syntactic structures. The results demonstrate the effectiveness of coresets in maintaining the representativeness of the original data, reducing noise and allowing an average performance equivalent (or even better) than the one obtained with the full datasets. Furthermore, an increase in the average performance on different datasets is observed when the model incorporates external features, improving the claim detection capabilities compared to the exclusive use of LLM representations.
- Open PDF (Slides)
- Open PDF (Thesis)
- Course: Computer graphics
- Tools and techniques: Blender
- Description: This project aims to create a detailed 3D model of a turntable and a sound system using Blender software. The modeling involves creating shapes and aggregating various 3D objects into a single cohesive scene. Advanced shaders are used to develop specific textures for each component, contributing to a visually realistic representation. Additionally, the project includes the creation of animations that will simulate the movements of the turntable and the sound system.
- Open PDF
- Course: Data semantics
- Tools and techniques: Python, Word2Vec, strong text preprocessing techniques, Bokeh interactive visualization
- Worked on: Unsupervised apporaches for key-phrases extraction and graph visualization development
- Description: This project uses a large corpus of texts from different sources to analyze and represent the perception of various cities around the world, focusing on the extraction of thematic keywords. Both supervised and unsupervised approaches are employed, using textual representations based on Word2Vec for the extraction of keyphrases. The city representations are linked based on the similarity of the terms and the context in which they are used, allowing for an in-depth comparison between the different perceptions. Finally, an interactive interface is developed to visualize the results, highlighting the most common meanings associated with each city, their number and their similarity to the perceptions of other cities. This approach offers a new way to explore urban narratives and global cultural interconnections.
- Open PDF
- Course: Data science lab
- Tools and techniques: R Studio, time series forcasting models, Self-Organizing-Maps
- Worked on: Self-Organizing-Map to indetify different consumption behaviour patterns
- Description: This study analyzes a time series related to the energy consumption of a building at the University of Bicocca, applying an autoregressive model for the forecasting of future consumption. To understand consumption variations, a Self-Organizing Map (SOM) is used to observe the differences in consumption trends based on variables such as percentage consumption and the increasing or decreasing intensity of hourly consumption. The analysis also integrates external meteorological data, such as temperature and humidity, to evaluate their impact on energy consumption. This approach allows to distinguish different periods of the year in which consumption levels present similar or divergent patterns, highlighting critical moments and offering ideas for energy efficiency and sustainable management interventions.
- Open PDF














