Alexandre De Zotti
Data Scientist
I am a data scientist with a PhD in Mathematics who focuses on solving problems using natural language processing. I use mainly Python and data science & machine learning libraries.
This data science project portfolio is composed of 9 projects covering a wide spectrum of the data science processes.
Name | Project Description | Project Type | Personal Contributions | Activities | Tools | Data Type | Project Page | Project Link | Completion |
---|---|---|---|---|---|---|---|---|---|
Reproduction of the results of the article 'Analysing Transformers In Embedding Space' for a small GPT-2 model (see original research article linked) | Understanding the methods presented in the article and applying them a local version of a small instance of GPT-2 | Personal Project | Understanding ML research paper, implementation of methods | LLM interpretation | GPT-2, Hugging Face | machine learning model | page | 2023-11 | |
Toki Pona with nanoGPT | A tiny language model for Toki Pona using a custom tokenizer | Personal project | Gathering and cleaning of training data, designing a custom tokenizer specific to Toki Pona, training nanoGPT | text generation | nanoGPT, ETL, tokenizer | text | 2023-11 | ||
Climate Battle Cards - web scraping | Using NLP to power a fact checking app | Team project | Scraping text data from different online sources related to climate change for the Climate Battle Cards project | web scraping | Pandas, Selenium, BeautifulSoup, html2text, Python | text, web | page | 2021 | |
Hateful Memes: Phase 1 | Classify memes | Competition personal submission | Classifiying memes (image+caption): obtained AUC-ROC of 0.6 (ranked 238/3279): concatenation of the features at the output of BERT and InceptionV3, training a classification layer on top of these features | exploratory data analysis, multimodal ML, classification layer training | Pandas, Python, Matplotlib, Seaborn, TensorFlow, BERT, InceptionV3, Jupyter | images, text, memes, supervised | page | code | 2020-10-27 |
Smartify Legal Docs | Add relevant additional information on legal documents | Hackathon team work | Extraction of text from PDF files, then applied Named Entity Recognition tools to find address, currency and organisation references, output in JSON format (for backend processing) | PDF to text conversion, NER, regexp | JSON, NLTK, RE, PDFMiner, Python | text, pdf | page | code | 2020-09-20 |
COVIDTracker | Study of social media posts (tweets) related to Covid-19 in order to do localised prediction of number of cases | Hackathon team work | Analysis of social meda posts: EDA including sentiment analysis, k-means clustering and visualization of TFIDF using PCA, extraction of topics using LDA, hand labelling, classification using FastText, attempt at fine tuning BERT | exploratory data analysis, clustering, k-means clustering, visualization, LDA, hand labelling, classification, text classification, fine tuning, sentiment analysis | PCA, Python, FastText, TextBlob, JSON, NLTK, Scikit-learn, Pandas, PyTorch, Transformers | text, social medias, supervised, unsupervised | page | code | 2020-04 |
Covics-19 | Organise supply exchange between hospital during the pandemic | Hackathon team work | Optimization of MongoDB process, Creation, training and evaluation of a recurrent network model | querying NoSQL database, updating NoSQL database, checking database content, designing a RNN, training a RNN, evaluating a RNN, LSTM | JSON, MongoDB, Python, TensorFlow | numerical values, time series, tabular | page | code | 2020-04-26 |
Spread Modelling | Centralization of COVID-19 case data and global modelling of the spread of the disease | Hackathon team work | Downloaded and cleaned data about the cases of Covid-19 from official sources (UK and Switzerland) | data preparation | Pandas, Python | tabular | page | code | 2020-03-30 |
GTP-2 As A Universal Library | Personal project | Tweaking of the generator loop of the language model GPT-2 in order to control text generation | text generation | TensorFlow, GPT-2, Python | text | page | code | 2019-10-13 |
Name | Description | Link | Completion |
---|---|---|---|
Python/fractals/julia_sets.py | A Python program that draws Julia set, added to the The Algorithms - Python educational repository. | link | Code merged |
mathvsg | A Python module to draw mathematical diagrams into SVG files, fully tested and documented: saves time, is more controllable and accurate | link | Maintained |
Hupacman | A wrapper of Arch Linux' Pacman package manager with easy to remember action names (saves me time) | link | Maintained |
Programs for complex dynamics | Python and Cython programs that draw representations of phase locking diagrams and transcendental dynamics | link | 2020-01-09 |
My portfolio web site | Flask+HTML+CSS+js | link | Maintained |