⚡ Consumption & Gas Emissions Prediction for the City of Seattle

🏢 Anticipating Building Consumption Needs (Modeling) - OpenClassrooms Training
🔍 Data Exploration Project - Prediction of Emissions and Energy Consumption for Non-Residential Buildings in Seattle

🏢 CONTEXT🌍
I work for the city of Seattle as part of the goal to become a carbon-neutral city by 2050. My team focuses on the consumption and emissions of non-residential buildings. In this project, we utilized detailed surveys conducted by city agents in 2016 to gather data on the consumption and emissions of buildings. The objective is to predict CO2 emissions and total energy consumption for non-residential buildings that have not yet been measured.

🚀 DELIVERY🚀

📓 Exploratory Analysis Notebook: This notebook will contain a brief exploratory data analysis, highlighting the key features of the studied non-residential buildings.
📓 Prediction Notebooks : We will test different prediction models for CO2 emissions and energy consumption of non-residential buildings. Each notebook will describe in detail the various modeling steps and clearly identify the final chosen model for each prediction.
📊 Presentation Support : It will present the problem, the dataset used, approaches to feature engineering and modeling, as well as the achieved results. A discussion on the choices made and a rigorous performance evaluation will also be included.

🔧 SKILLS🔧
In this project, I honed the following data science skills:
🔍 Data Cleaning: Expertise in cleaning and preprocessing raw data for accurate analysis.
📊 Data Exploration: Proficiency in exploring data to uncover insights and patterns.
⚙️ Feature Engineering: Skill in creating meaningful features to improve model performance.
🧪 Testing Different Modeling Approaches: Experience in evaluating and comparing various modeling techniques.

GitHub Link - See the full presentation

🛍️ Customer Segmentation for Olist E-commerce Platform

🏢 Anticipating Building Consumption Needs (Modeling) - OpenClassrooms Training
🔍 Data Exploration Project - Prediction of Emissions and Energy Consumption for Non-Residential Buildings in Seattle

🏢 CONTEXT🌍
I am a consultant for Olist, a Brazilian company specializing in online marketplace sales. The goal of this project is to provide Olist's e-commerce team with customer segmentation based on their behavior and personal data. This segmentation will be used to optimize communication campaigns and gain better insights into different user types.

🚀 DELIVERY🚀

📓 Exploratory Analysis Notebook: This notebook will contain an exploratory analysis of data provided by Olist, highlighting key customer characteristics and purchasing behavior.
📓 Modeling Notebook: In this notebook, we will experiment with different unsupervised modeling approaches to segment Olist's customers into similar groups. We will evaluate model performance and select the best segmentation model.
📓 Simulation Notebook: We will conduct a simulation to determine the necessary update frequency for the segmentation model to maintain its relevance over time. This simulation will help us propose an appropriate maintenance contract.
📊 Presentation Support: We will prepare a presentation for a colleague to gather feedback on our approach. This presentation will cover the problem statement, data cleaning, feature engineering, data exploration, different modeling approaches, the final selected model, and the simulation for maintenance.

🔧 SKILLS🔧
In this project, I honed the following data science skills:
🔍 Data Cleaning: Expertise in cleaning and preprocessing raw data for accurate analysis.
🤖 Unsupervised Modeling : Proficiency in applying various techniques to segment customers into meaningful groups (DBScan, CAH, Kmeans).
⚙️ Simulation and Maintenance: Skill in simulating model performance and proposing maintenance strategies (ARI).

GitHub Link - See the full presentation

🌐Classification of Consumer Goods for a Marketplace

🏢 CONTEXT🌍
I am working on a text and image classification project in the field of artificial intelligence. The goal of this project is to develop machine learning models capable of automatically classifying text documents and images into different predefined categories. This classification will be valuable in various domains such as information retrieval, content management, spam detection, and more.

🚀 DELIVERY🚀
📝 Text Preprocessing and Feature Extraction Notebook: This notebook will contain the preprocessing of textual data and the extraction of relevant features for classification. Techniques such as tokenization, normalization, text cleaning, and feature extraction will be applied.
🖼️ Image Preprocessing and Feature Extraction Notebook: In this notebook, we will perform preprocessing of image data using techniques like resizing, normalization, feature extraction (e.g., using pre-trained CNNs), etc. We will explore the feasibility of image classification within the project context.
🧠 Supervised Image Classification Notebook: We will develop and evaluate supervised classification models for image data. Different architectures of convolutional neural networks (CNNs) will be experimented with. We will adjust hyperparameters, perform transfer learning, and assess model performance.
📥 Import and Integration of Additional Data: We will import and leverage additional data from the "champagne" source as part of the project. Preprocessing and integration of this data will be conducted to enhance our existing dataset.
🎤 Project Presentation : We will prepare a presentation for the project team, covering the project context, data preprocessing, modeling techniques, obtained results, API usage, importation of "champagne" data, and future prospects.

🔧 SKILLS & TOOLS 🔧
In this project, I honed the following data science skills:
📊 Text Preprocessing: Tokenization, Stemming, etc.
🖼️ Image Preprocessing: Skill in preprocessing image data to enhance model input quality.
🧠 Feature Extraction for text : BoW, TG-IDF, Word Vectors, BERT, USE
🧠 Feature Extraction for images : SIFT, VGG16 pretrained
📈 Supervised Classification: Expertise in developing and evaluating supervised classification models
🔄 Data Integration: Proficiency in importing and integrating additional data sources to enrich existing datasets.
🎙️ Project Presentation: Ability to communicate project details, methodologies, and results effectively through presentations.

GitHub Link - See the full presentation

💰 Scoring Model Development for a Financial Institution

🧑‍💼MY ROLE🧑‍💼
I am a Data Scientist at Ready-To-Spend, a company that offers consumer credits to individuals with limited or no credit history.

🏢 CONTEXT🌍
The company aims to implement a "credit scoring" tool to calculate the probability of a client repaying their credit and then classify the credit application as approved or declined. Thus, the goal is to develop a classification algorithm leveraging various data sources (behavioral data, data from other financial institutions, etc.). Additionally, client relationship managers have highlighted the growing demand for transparency regarding credit approval decisions. This client transparency request aligns perfectly with the company's core values.

🛠️SOLUTION🛠️
Ready-To-Spend decides to create an interactive dashboard for relationship managers. This dashboard aims to transparently explain credit approval decisions to clients and also provide easy exploration of their personal information.

📊AVAILABLE DATA📊
Anonymized client data including :
- TARGET: 1 for payment difficulties, or 0 for no difficulties (the prediction target).
- A wealth of client data: e.g., whether the client owns a car, real estate, number of children, income, income source, type of residence, education, and details about their loan application process.

🎯MISSION🎯
- Build a scoring model for automatic prediction of client bankruptcy probability.
- Develop an interactive dashboard for relationship managers to interpret model predictions and enhance their understanding of client behavior.
- Deploy the predictive scoring model using an API, along with the interactive dashboard that interfaces with the API for predictions.

🚀 DELIVERY🚀

📊 Data Drift Report: A report on data drift, made with Evidently, showcasing changes in raw data over time.
📝 Modeling Notebooks:
- Guarneri_Naomi_1_modelisation_062023.ipynb: Notebook for data modeling with initial attempts.
- Guarneri_Naomi_2_modelisation_balanced_062023.ipynb: Similar to the previous, focusing on addressing the imbalanced problem.
- Guarneri_Naomi_3_modelisation_balanced_LR_seuil_cost_score_fonctions_062023.ipynb: Selection of the best model, cost function determination, and optimal threshold identification.
📑 Methodological Note detailing the model selection process and training methodology.
📑 Presentation, work accomplished within the project.

📚 README: README.md, a descriptive file.

🔧 SKILLS🔧
In this project, I honed the following data science skills:
📈 Data Modeling: Proficiency in developing predictive models for classification tasks.
📑 Methodology Documentation: Skill in documenting the methodology followed in model selection and training.
📊 Data Drift Monitoring: Proficiency in tracking and reporting data drift changes over time.
🌐 API Deployment: Skill in deploying models into production using APIs for real-time predictions.
🌐 Dashboard : Creating a dashboard with Streamlit.
📈 Model Evaluation: Expertise in evaluating model performance and creating customized cost function.

GitHub Link - See the full presentation

☁️ Cloud-Based Model Deployment for an AgriTech Startup

🏢 CONTEXT🌍
Fruits! is an AgriTech startup that aims to offer innovative solutions for fruit harvesting. Their concept: Develop smart harvesting robots to preserve fruit biodiversity by enabling specific treatments.

🎯 OBJECTIVE & SOLUTION 🎯
Objective : Gain recognition by providing a mobile app that allows users (the general public) to obtain information about the fruit captured in a photo.
Solution :Implementation of the initial version of the fruit image classification engine, along with the initial version of the required Big Data architecture.

📊 DATA📊: Kaggle Fruits 360 Dataset
Dataset Source: Kaggle
Data Type: Images of fruits and vegetables on a white background, from various angles, extracted from timelapse videos
Folder Configuration: Each variety is a folder containing photos of that specific variety
Image Type: JPG 100 x 100
Dataset Size Used: 22,688 images

🎯MY MISSION🎯
Create a Big Data environment on Amazon Web Services using S3, EMR, IAM
Perform image processing pipeline in the environment using PySpark

🚧THE WORK🚧

1🏗️ Creating the environment
📁 Creating an S3 bucket for image storage
⚙️EMR instance with spark jupyterhub tensorflow
2.📝Notebook for processing and analysis
📥 Downloading images to the S3 bucket
💼 Creating a Spark session
📥 Loading images
🔨 Preprocessing images
📲 Feature extraction using the MobileNet16 model
📊 Training and fitting PCA on 22,668 images, SAVING PCA
3. Test on 10 images, reusing the saved PCA.
4. 📋 Final presentation

GitHub Link - See the full presentation

All Projects Overview

⚡ Consumption & Gas Emissions Prediction for the City of Seattle

🛍️ Customer Segmentation for Olist E-commerce Platform

🌐Classification of Consumer Goods for a Marketplace

💰 Scoring Model Development for a Financial Institution

☁️ Cloud-Based Model Deployment for an AgriTech Startup

Send me a message 😊