Projects

Here are several course projects~

Airbnb Analysis

2024/5

Abstract:We first conducted an exploratory analysis of the data, then conducted relevant time series analysis, discovered quarterly changes in housing popularity, built multiple models for detailed regression analysis and testing, built a recommendation system based on the data, and, based on reviews, drew a word cloud diagram.

Keywords: EDA, time series analysis, regression, recommendation system

You can find our project website here

You can find our code&slides here

Bilibili Analysis

2023/10

Utilized techniques such as ANOVA, two-way ANOVA, and ANCOVA to quantify the significant impact of factors such as tags, gender, video length, and their interactions on the number of followers.
Employed GMM fitting to determine the boundary between long, medium, and short videos on Bilibili with 6.17 and 18 minutes and explored the relationship between video length and the number of followers.
Conducted variable selection using methods such as RandomForest, Lasso,XGBoost and HDCSIS to identify the most relevant variables to the number of followers for further analysis.
Utilized methods such as weighted least squares, generalized linear models, and other 5 models for regression analysis and provided model interpretations.
Used K-medoids for content creator classification and designed underestimation mechanism to identify the most promising and cost-effective categories and video formats. Provided recommendations for content creator positioning.

You can find our project paper here

You can find our code&slides here

Exploratory Analysis on Customer Segmentation Data

2023/5

Abstract In this post, we used data from kaggle related to credit card customers. We first carry out suﬀicient data preprocessing, including data cleaning, data description and brief statistics description. After a series of variable adjustment and principal component analysis, we divided users into four categories, and each group showed unique consumption habits and credit behavior. We also selected data related to credit limit according to distance correlation, and classified customers by KMeans algorithm according to these data, and obtained customer groups with low credit, medium credit and high credit. We also perform regression analysis on the credit limit, and give a classification-based random forest regression method that performs better than the ordinary random forest regression. Finally, we also discuss the mainstream clustering methods and the reasons for choosing KMeans in this dataset.

Keywords: PCA , KMeans , Regression , Clustering ，Dimension Reduction，Feature selection