Projects
Academic Projects
These projects were done as a part of cousework or school projects in my undergraduate (Manipal Institute of Technology) and graduate school (Duke University). Click on the project title to view the Github repository (codes/outputs and other details) for the same.
Ongoing Capstone Project: Participatory agent-based Bayesian modeling for community hunting mngmt.
Creating a participatory agent-based Bayesian model (PABBM) using a hierarchical Bayesian Modelling and Agent-based modelling that simulates community behaviour for bushmeat hunting and predicts the results of changes to these patterns.
Microsoft Malware Prediction
Used Generalized Linear Model to classify cases of malware detection in a Windows Machine using variables like machine-specific traits, user traits and other factors like demographic traits.
Deep learning Approach for Question and Answering Systems
Developed a Question Answering framework on the Facebook bAbI dataset using N-gram classifier, LSTM (Long Short-term Memory), End-to-end Memory networks, Seq2Seq (sequence-to-sequence) model and LSTM + attention models.
TDA: Segmentation of LEGO Parts using Topological and Conventional Features
Used Topological Data Analysis to extract topological features from 2D images of LEGO parts and create an image classification model to compare the performance of this model using conventional and topological features.
Generative Adversarial Networks and their Applications
Created an extensive tutorial and whitepaper on GANs (Generative Adversarial Networks)i.e. General GAN algorithm, DCGAN (Deep Convolutional GANs), SAGAN (Self-Attention GANs), CGAN (Conditional GANs) i.e. pix2pix model.
Solar PV in Aerial Imagery
Created and compared classification models to detect solar panels in aerial imagery data. We built a conventional ML model i.e. LightGBM to establish a baseline and an EfficientNet-B4 model under AdvProp Training Scheme as a high-performing model for the same.
Image Caption Generator
Used image inputs and built a Bahdanau Attention-based recurrent neural network model to generate captions for the images. Built a webapp with a simple UI for the same and deployed it using Kubernetes on Google cloud platform. Used Locust on the Google Kubernetes engine for load testing of the app.
Makeover Monday: Tableau Dashboard - "Who could make the next acceptable James Bond?"
Participated in a fun Makeover Monday dashboarding exercise to build an interactive Tableau dashboard to visualize what traits are preferred by the audience in the next actor who plays James Bond.
Text Analysis: Deconstructing Ted Talks
Used Part-of-speech tagging, Sentiment Analysis, Named Entity Recognition and Topic Modelling to extract features from TED talks transcripts and build a predictive ML model to estimate the popularity (no. of views). Additionally, built a demo LSTM model to generate the transcript for an effective TED talk.
Gender classification of Speech Signals using ANFIS
Features like pitch, amplitude and entropy of the speech signals were extracted to create an ANFIS (Adaptive neuro fuzzy inference system) based classification model in MATLAB for gender identification of the speaker.
Professional Projects
These projects were executed either individually or as a part of a team during my stint as a Quantitative Analytics Intern at Wells Fargo, Data Science Summer Project Member at Fleet Management Ltd.(contract via Duke University) and as a Decision Scientist at Mu Sigma.
Internship: Study the impact of language-drift on Classification Models (in Banking) built using that vocabulary
Contributed to Wells Fargo's research repository by evaluating the impact of evolution of vocabulary in a text corpus (customer complaints) on classification models that use this textual data. For this internship, I quantified the correlation between language complexity across open-source customer complaints data and model evaluation parameters for models built for classifying mortgage payment failures.
Summer Project: Study and prediction of on-ship incidents using manual inspection reports
Used time-series analysis and NLP (TF-IDF and Topic Modelling) to predict near-misses (incidents that were highly probably but were saved) and incidents on ship, using manual inspection reports created by Fleet Management teams
New Product recommendation for UK Retailer
Based on attributes and market performance of competitor products, new products were identified to include in the client portfolio. The performance (impact of introduction) of these parts was predicted on the basis of historical performance of similar products in the clients' product portfolio. The results were presented in a prescriptive decision board to help purchase-mamangers make product sourcing decisions across 13 countries.
Pricing Toolkit for automotive aftermarket parts
Built an RShiny dashboard for the aftermarket sector of a leading automotive giant with 3 modules:- Reporting Module: automated reports for spare-parts pricing anomalies; Pricing Post-Audit Module: used A/B testing and lift analysis to study the revenue, margin and volume impact of recent pricing decisions; Strategic Pricing Module: used regression models to evaluate price elasticity of auto-parts, did exploratory analysis of competitive pricing and product lifecycle to provide pricing recommendations.
Emerging Issue Analytics- Vehicle Recall Prediction
Customer complaints from NHTSA were treated and cleaned with text processing methods: Stop Word removal, bigram treatment, stemming, word-frequency treatment, word-length treatment. These treated complaint phrases were then analyzed using STM (structural topic modelling) to derive key themes and vehicle failure patterns. Along with other data sources like warranty data and customer transaction data, a vehicle recall was predicted at least 3 years before the actual recall was initiated (i.e. 7 years).
Machine Failure Prediction
Built and compared several predictive models like gradient boosting, logistic regression and random forest to classify a rare machine failure event (in injection molding machines for en energy firm) with and without using oversampling methods (SMOTE, Adasyn). Also built a demo model using a Deep Learning autoencoder to compare the performance in case of rare event prediction.
Identify misrepresented catastrophic insurance claims
Claim descriptions were analyzed using TD-IDF analysis for extracting keywords to create features along with temporal and geographical attributes. Multiple classification models were built and ensembled using these features, of which the best performing model was selected using cross-validation and hyperparameter tuning.
Dealer Fraud Identification in Automotive Dealerships
Used Benford's Law to observe anomalous purchase and sales trends among dealers for an automotive giants. We then created a classification model to help flag suspicious dealer fraud cases based on anomalous purchase and sales trends.
Personal Projects
These personal projects have been executed for several experimental projects, several datathons, Kaggle competitions and to understand several theoretical concepts. Some of these projects are yet to be improved, but have their basic skeleton completed. This list here includes projects that were executed before graduate school and during my tensure at graduate school.
TAMU Datathon: Walmart Product Search Engine
Wrote a parallelized web scraper to pull product information from Walmart.com. Built a classification model to find the best suited category for a paticular product. Used graph analysis to build a similar product recommender based on a given products' traits.
MIT COVID-19 Datathon: Evolution of Public Concerns over the course of the pandemic
Use news articles published in New York to extract key themes of public concerns using topic modelling. Studied the trends of these themes over the course of the pandemic to investigate what the key public concerns were.
Kaggle- Kickstarter Project performance prediction
Built a classification model to predict the success of a Kickstarter project using historical Kickstarter projects data with attributes like name of the project, duration, category, goal and other derived metrics.
Kaggle- Prediction of House Prices
Using over 75 predictors, a regression model was created after extensive data exploration, feature engineering and data treatment. The results are currently in the top 31% of the leaderboard.
Kaggle- Google Analytics Customer Revenue Prediction
On the basis of the data on a customer's visit to a Google Merchandise Store (or a GStore), the revenue per customer was predicted. Features like the channel and device used to access the GStore among others were used for data preprocessing and feature engineering before passing them through a regression model.
Analytics Vidhya- Prediction of user performance in a programming problem
This multiclass classification problem uses a problem and a programmers' attributes to predict range of attempts it will take the programmer to solve the problem. The output is to serve as an input to a problem recommendation engine and to provide hints on problems users are likely to get stuck on.
Other Projects
These projects were done in my capacity as a research assistant Duke University. As a research assistant, the projects were done in a team under the guidance of a professor or a team lead. The time period of these projects was generally 5-6 months.
Ongoing: Detection of text-recycling in STEM research reports
Using text analysis to detect the presence and amount of text-recycling (using your own historical writing) in STEM reseach reports. I am doing this research under the guidance of Dr. Cary Moskovitz
Study and Analysis of co-authorships and reseach patterns at Duke Global Health Institute
Designed metrics to evaluate the number of past publications of DGHI researchers as first and last authors and their past international collaborations (co-authorships) to further analyze research trends at DGHI.