Alexandre De Zotti

Data Scientist

 

I am a data scientist with a PhD in Mathematics who focuses on solving problems using natural language processing. I use mainly Python and data science & machine learning libraries.

This data science project portfolio is composed of 9 projects covering a wide spectrum of the data science processes.

Name Project Description Project Type Personal Contributions Activities Tools Data Type Project Page Project Link Completion
Reproduction of the results of the article 'Analysing Transformers In Embedding Space' for a small GPT-2 model (see original research article linked) Understanding the methods presented in the article and applying them a local version of a small instance of GPT-2 Personal Project Understanding ML research paper, implementation of methods LLM interpretation GPT-2, Hugging Face machine learning model page 2023-11 explanable AI, model interpretation, transformer
Toki Pona with nanoGPT A tiny language model for Toki Pona using a custom tokenizer Personal project Gathering and cleaning of training data, designing a custom tokenizer specific to Toki Pona, training nanoGPT text generation nanoGPT, ETL, tokenizer text 2023-11 NLP, tokenizer, text generation
Climate Battle Cards - web scraping Using NLP to power a fact checking app Team project Scraping text data from different online sources related to climate change for the Climate Battle Cards project web scraping Pandas, Selenium, BeautifulSoup, html2text, Python text, web page 2021 web scraping, Pandas, Selenium, BeautifulSoup, Python
Hateful Memes: Phase 1 Classify memes Competition personal submission Classifiying memes (image+caption): obtained AUC-ROC of 0.6 (ranked 238/3279): concatenation of the features at the output of BERT and InceptionV3, training a classification layer on top of these features exploratory data analysis, multimodal ML, classification layer training Pandas, Python, Matplotlib, Seaborn, TensorFlow, BERT, InceptionV3, Jupyter images, text, memes, supervised page code 2020-10-27 Python, EDA, classification, BERT, computer vision, NLP, TensorFlow, Python, Matplotlib, Seaborn, Visualization, sentiment analysis
Smartify Legal Docs Add relevant additional information on legal documents Hackathon team work Extraction of text from PDF files, then applied Named Entity Recognition tools to find address, currency and organisation references, output in JSON format (for backend processing) PDF to text conversion, NER, regexp JSON, NLTK, RE, PDFMiner, Python text, pdf page code 2020-09-20 NER, NLTK, PDF, Python
COVIDTracker Study of social media posts (tweets) related to Covid-19 in order to do localised prediction of number of cases Hackathon team work Analysis of social meda posts: EDA including sentiment analysis, k-means clustering and visualization of TFIDF using PCA, extraction of topics using LDA, hand labelling, classification using FastText, attempt at fine tuning BERT exploratory data analysis, clustering, k-means clustering, visualization, LDA, hand labelling, classification, text classification, fine tuning, sentiment analysis PCA, Python, FastText, TextBlob, JSON, NLTK, Scikit-learn, Pandas, PyTorch, Transformers text, social medias, supervised, unsupervised page code 2020-04 Python, JSON, latent Dirichlet allocation, sentiment analysis, fine tuning, transformer, classification, text classificiation, k-means, clustering, exploratory data analysis, PCA, TFIDF, BERT, Scikit-learn, FastText, TextBlob, NLTK, sklearn
Covics-19 Organise supply exchange between hospital during the pandemic Hackathon team work Optimization of MongoDB process, Creation, training and evaluation of a recurrent network model querying NoSQL database, updating NoSQL database, checking database content, designing a RNN, training a RNN, evaluating a RNN, LSTM JSON, MongoDB, Python, TensorFlow numerical values, time series, tabular page code 2020-04-26 RNN, LSTM, JSON, NoSQL, Python, TensorFlow
Spread Modelling Centralization of COVID-19 case data and global modelling of the spread of the disease Hackathon team work Downloaded and cleaned data about the cases of Covid-19 from official sources (UK and Switzerland) data preparation Pandas, Python tabular page code 2020-03-30 Pandas, Python, data preparation
GTP-2 As A Universal Library Personal project Tweaking of the generator loop of the language model GPT-2 in order to control text generation text generation TensorFlow, GPT-2, Python text page code 2019-10-13 TensorFlow, NLP, transformer, text generation

Other projects

NameDescriptionLinkCompletion
Python/fractals/julia_sets.py A Python program that draws Julia set, added to the The Algorithms - Python educational repository. link Code merged
mathvsg A Python module to draw mathematical diagrams into SVG files, fully tested and documented: saves time, is more controllable and accurate link Maintained
Hupacman A wrapper of Arch Linux' Pacman package manager with easy to remember action names (saves me time) link Maintained
Programs for complex dynamics Python and Cython programs that draw representations of phase locking diagrams and transcendental dynamics link 2020-01-09
My portfolio web site Flask+HTML+CSS+js link Maintained

My ORCID record (research in mathematical sciences)