Using the
.info
method, the data has no null values and the features are set to the appropriate data type. Moreover, the purpose column needs no improvement as it contains unique and justified values. For the classifier, the target variable is the not.fully.paid feature. Note that there is class imbalance as there are fewer examples of loans not fully paid. Specifically, eight thousand forty-five (8045) are fully paid while one thousand five hundred thirty-three (1533) are not. This is important to note since machine learning classifiers tend to underperform when class imbalance exists.
Feature Engineering
The heatmap of the correlation matrix between the numeric features are shown below. It is shown that FICO credit score and the interest rate are highly correlated. This is justified since FICO credit scoreare used by lenders to determine thes interest rate on loan. We opt to retain the weak to moderately correlated features. It is also possible that the features are too many. The classifier must also be fitted on a training data with a lower dimension. Dimensionality reduction is implemented using the Principal Component Analysis (PCA). By reducing the dimension to two (2), the features explains 99.9% of the variance.
For the categorical variable purpose, a one-hot encoding is applied to create a binary matrix.
Classifier
We use an extreme grandient-boosted (XGBoost) tree classifier for this situation. A gradient boosted tree classifier utiilizes an ensemble of decision trees to make predictions. In addition, XGBoost is a specific implementation of a gradient boosted tree classifier that is popular due to its speed, scalibility, and accuracy. Note that the manager wants to accurately predict if a loan will not be paid back. Since 1
is the value for a loan not getting paid back, the recall score is the metric that the manager prefers to see. For this model, the recall on the testing set is 10% which means that the model has a high accuracy on determining loans that will not be paid back. When the model is trained on a data whose dimension is reduced using PCA, the recall drops significantly. Hence, PCA does not help in this case. "pipieline found is also a gradient-boosted tree classification model applied on the features scaled on a particular range. Looking at the recall scores when the pipeline is fitted on the original, the recall scores are 100% and 10% on the training and the testing sets respectively. This is clearly a case of underfitting. Now, applying the pipeline on the synthetically-balanced data, the recall scores are 100% and 86% on the training and the testing sets respectively. Again, the model performs better on the balanced data than on the original data.
Conclusion
In this case, due to running time and near recall scores on training and testing sets, the XGBoost Classifier is a good choice for finding loan application that will not be paid back due to its high score. (The associated codes in this project is located in the Jupyter
notebook.)
The dataset consists of the number of public bikes rented in Seoul's bike sharing system at each hour. It also includes information about the weather and the time, such as whether it was a public holiday.
For data preprocessing, the column names of the dataset needs renaming as some are lengthy. For instance, the 'Dew point temperature(C)' column name is renamed as 'Dew Temp' and the 'Solar Radiation (MJ/m2)' column name is renamed as 'Solar Rad'. The dataset contains no missing values with six columns of floating point numbers, four columns of signed 64-bit integers, and four columns of string datatypes.
Observe the barplot shown below. Summer is the season where most bikes are rented. Also, a non-holiday has a slightly better number of rented bikes compared to a holiday. The same observation holds for if the hours are considered instead of seasons. In this setting, bikes rented are high during the late afternoon to early evening hours than any other time window.
Temperature and dew point temperatures are the highly correlated features in the dataset. A principal component analysis helps mitigate correlation. Note that decision tree-based models are immune to multicollinearity.
Now, we proceed to the machine learning (ML) model. ML requires numerical values for training a model. One-hot encoding is a way to turn variables from categorical into numerical. We use a decision tree to predict the number of bikes rented. The coefficient of determination score is 0.79. We also utilize a model made up of multiple decision trees. This model is called a Random Forest model. The coefficient of determination score is slightly better than the previous model. For predictions, we can use this model instead. The model considers the hour and the temperature a person rents a bike as important predictors. On the other hand, the amount of snowfall is considered the least important.
Source of dataset.
Citations: - Sathishkumar V E, Jangwoo Park, and Yongyun Cho. 'Using data mining techniques for bike sharing demand prediction in metropolitan city.' Computer Communications, Vol.153, pp.353-366, March, 2020
- Sathishkumar V E and Yongyun Cho. 'A rule-based model for Seoul Bike sharing demand prediction using weather data' European Journal of Remote Sensing, pp. 1-18, Feb, 2020
(The associated codes and implementations in this project is located in the Jupyter
notebook.)
Ahoy! Kaggle is hosting a titanic machine learning
competition where the goal is to classify whether a passenger survives or not.
For each passenger, the features include the following:
- Ticket class
pclass
: 1
= 1st, 2
= 2nd, 3
= 3rd - Sex
sex
- Age in years
Age
- Number of siblings or spouses aboard
sibsp
- Number of parents or children aboard
parch
- Ticket Number
ticket
- Fare
fare
- Cabin Number
cabin
- Port of Embarkation
embarked
: C
= Cherbourg, Q
= Queenstown, S
= Southampton - Passenger ID
PassengerId
- Passenger Name
name
To start with the classification, we preprocess the dataset. We drop the PassengerId
, the ticket
, and the name
columns as they contain unique values. We replace the male
value to 0 and the female
value to 1 in the sex
column to contain numerical values. Observe that the columns age
, cabin
, and embarked
are the columns with missing values. By setting a threshold of 30% for dropping features, we drop the cabin
column. The mode is used to impute the categorical feature embarked
, while the mean is used for the numerical feature age
. In this case, mean imputation is justified since the ages are not highly skewed, as shown in the histogram below. Lastly, a one-hot encoder is implemented on the categorical variables pclass
and embarked
for training purposes.
None of the features are highly correlated to each other, as presented in the correlation heatmap below. However, the ages and the fares have high variances. Hence, we standardize those features by removing the mean and scaling to the unit variance. Next, we split the data where 75% comprises the training data. Now, we are ready for model training.
We choose among the logistic regression, the k-nearest neighbors (KNN), and the gradient-boosted decision tree (GBDT) models for binary classification. Note that decision trees are usually insensitive to scaling. This means that we can use the scaled data for fitting across all models. Logistic regression are intended for linear solutions while KNN are intended for non-linear solutions. All the models with default hyperparameters give an accuracy of around 75%. To improve the models, we employ stratified k-fold cross-validation and a randomized search cross-validation for hyperparameter tuning. Both the logistic regression and the KNN models produce an accuracy close to 85% while the GBDT model produce an accuracy of 88%. Although this is a slight advantage to the other models, we choose the GBDT model for the predictions.
The data for prediction has the same features. We preprocess the data similar to the previous data. However, in this case, the
fare
column has a null value. For this feature, we impute with the median since the data is right-skewed, as shown in the histogram below. Afterwards, the similar steps follow until the fitting of data into the chosen model. We predict the survivability of the passengers and save the results as a comma-separated values (CSV)
file for submission. Upon submisison, Kaggle reveals that the predictions are 77.2% accurate.
We also utilized the automated marchine learning package
TPOT. This package enables one to find an ML model that works well for the cleaned data. In this case, we use it to find a non-deep learning model. After execution, the best pipeline found is also an GBDT model with Gini impurity for splitting the nodes. The predictions are different from the previously generated predictions but has the same accuracy score.
XGBClassifier(ZeroCount(SelectFwe(DecisionTreeClassifier(input_matrix, criterion=gini, max_depth=5, min_samples_leaf=4, min_samples_split=9), alpha=0.007)), learning_rate=0.5, max_depth=5, min_child_weight=16, n_estimators=100, n_jobs=1, subsample=0.7000000000000001, verbosity=0)
(The associated codes and implementations in this project is located in the Jupyter
notebook.)
Background
While summer already ended here in the Philippines, temperatures turn up for those in the Northern Hemisphere. There is no better time than now to hold pool and beach parties. In sync with this plans, the company has decided to host a dance party. A
list of tracks containing one-hundred twenty five (125) genres of Spotify music tracks was collected, with each genre containing approximately one thousand (1000) tracks. Each row represents a track that has some audio features, such as danceability and valence, associated with it. The most recent song in the playlist is released on October 2022.
Objective
We are tasked to curate a dance-themed playlist for the party in order to create an atmosphere that will let attendees dance their hearts out.
Data Preprocessing
To create a dance-themed playlist, a cluster analysis may be employed to allow grouping of 'similar' tracks. Before implementing clustering, we start by importing data as a dataframe using the pandas
library and cleaning the data, if needed. First, we drop rows with duplicate values across all features. Next, we drop rows with duplicates on the `track_id` column as tracks have unique identifications. Lastly, we drop rows with the same album name, track name, and set of artists. Note that album and track names are not subject to trademark while an artist does. Observations with missing artists, album name, and track name are also dropped. This process is justified because we cannot add an unknown song even though its audio features are given.
Feature Engineering
After cleaning the data, every track must associate to a unique id. Now, we can drop the `track_id` column. This process also saves memory storage.
One may argue to drop the `artists` column since artists may produce songs of different genres. Instead, the column may be replaced by the number of artists present. Observe that the artists are separated by semicolons (;).
Dilemma
The genre feature has one hundred fourteen (114) unique values. Some categories may classify into a single group but differences still occur. For examples, J-Pop and K-Pop may be considered as pop music but they have some differences.
Feature Selection
For now, we will not consider the track genre and use the remaining audio features. Cluster analysis are usually applied to solely continuous or categorical variables. The reason is that using, for example, Euclidean distance in clustering makes no sense for categorical variables. Mixed-type data are more common nowadays and studies uses one-hot encoding to analyze the data. For this project, we analyze using solely continuous, solely categorical, and mixed types.
Multicollinearity
To see correlation between the features, the heatmap corresponding to the correlation coefficients is shown below. None of the features are highly correlated to each other. Danceability and valence are moderately correlated. This correlation is justified since valence representative positiveness and more positive tracks are typically danceable songs. However, we intend not to remove valence since it still differs from the danceability score.
Anomaly Detection
As shown from the boxplots, almost all features contains outliers or anomalies.
Model Selection
We must choose clustering algorithms that are insensitive to outliers. Thus, either a density-based spatial clustering of applications with noise (DBSCAN) model or Guassian mixture models (GMM). GMM are characterized by high complexity and slow convergence. Hence, we can choose to implement DBSCAN. A closely related algorithm to DBSCAN is OPTICS. It is more suitable to large datasets than DBSCAN.
Playlist Creation
By observing the mean (or median) of the danceability score among the tracks in each cluster, we choose the cluster with the highest scores. Moreover, assuming the dance party is for adults, we remove songs with a children or kids genre. Lastly, the dance-themed playlist is completed by filtering the top 50 songs based on danceability. (The associated codes and implementations in this project is located in the Jupyter
notebook and Power BI eXchange
file.)
For this project, we use the
dataset found in Kaggle presenting the most streamed songs in the year 2023. We showcase some SQL and Power BI skills for personal purposes. First, the data is imported in Microsoft SQL Server Management Studio. The total number of tracks is nine-hundred fifty-three (953) with six-hundred fourty five (645) distinct artists or groups of artists.
Most of the tracks, specifically four-hundred two (402), are songs released in the year two thousand and twenty two (2022). In terms of age, the oldest track is Agudo Magico 3 by Stryx, utku INC, and Thezth released in 1930. On the other hand, the latest track is Seven by Jung Kook featuring Latto released on 14 July 2023. In terms of the number of tracks, Taylor Swift has the highest numbers of songs in the dataset with thirty-four (34) songs followed by The Weekend with twenty-two (22) songs.
Now, we examine the songs for the past five years. There are seven hundred sixty-nine (769) tracks mostly comprised of tracks released in 2022. In this five-year window, the tracks 'As It Was' and 'Blinding Lights' by the Weekend, are the two of the top ten songs in terms of the number of streams. Non-instrumental songs with a high danceability factor are the most streamed are preferred by listeners. The average beats per minute (bpm) is approaximately one hundred twenty-two (122), which is near the
perfect tempo considered to be a hit, according to some songwriters. Lastly, as a fun observation that needs to be experimented, about fifteen percent (15%) of the songs are released in May.
In this project, we scrape data from the beer section of the
Boozy website using the Beautiful Soup library in Python. Specifically, for each product in the said section, we extract the following:
- Name
- Price
- Star Rating
- Number of Reviews
A screenshot of the first page in the beer section of the website is shown below.
The program for data scraping is shown in the Python file named
scraper. In this implementation, we construct a Pandas dataframe where names are stored as string, prices and ratings as floating point numbers, and number of reviews as integers. In the following figure, we see the first five observations. Prices are in Philippines pesos while ratings are on an integer scale of 1 to 5 where 1 is the lowest and 5 is the highest.
Name | Price | Rating | No. of Reviews |
Engkanto Mango Nation - Hazy IPA 330mL Bottle ... | 543.0 | 5.0 | 2 |
Engkanto High Hive - Honey Ale 330mL Bottle 4-... | 407.0 | 4.0 | 5 |
Engkanto Green Lava - Double IPA 330mL Bottle ... | 594.0 | 5.0 | 1 |
Engkanto Live It Up! Lager 330mL Bottle 4-Pack | 407.0 | 5.0 | 1 |
Engkanto Paint Me Purple - Ube Lager 330mL Bot... | 543.0 | 5 | 1 |
As a web scraping project, we focused less on producing data mainly used for data analysis. However, we can provide some information about the products. We start by considering the prices. The cheapest beer products priced at 60 pesos are the 330 mL bottled and the 330mL canned versions of Tiger Crystal. On the other hand, the most expensive beer product is the Stella Artois 330mL Bottle Bundle of 24 priced at 3576 pesos. Next, we consider the ratings. The five beer products with the highest number of reviews are shown in the following table. The number of reviews is also indicated in the table.
Name | Reviews |
Heineken 330mL 6-Pack | 99 |
Crazy Carabao Variety Pack #1 | 42 |
Hoegaarden Rosee 750mL | 41 |
San Miguel Pale Pilsen 330mL Can 6-Pack | 33 |
Crazy Carabao IPA 330mL Bottle 6-Pack | 31 |
Now, we consider the star ratings. There are 77 products with a star rating of 5. We can filter this data further by including the number of reviews. The four highest-rated beer products with at least 5 reviews are presented in the following table, including their rating and number of reviews.
Name | Rating | Reviews |
Pilsner Urquell 330mL Bottle Pack of 6 | 5.0 | 12 |
Sapporo 330mL Bundle of 6 | 5.0 | 12 |
Stella Artois 330mL Bottle Bundle of 6 | 5.0 | 11 |
Paulaner Weissbier Dunkel 500mL Bottle | 5.0 | 11 |
Moreover, the four lowest-rated beer products are shown in the following table.
Name | Rating | Reviews |
Rochefort 8 330mL | 3.0 | 1 |
Royal Dutch Ultra Strong 14% 500mL | 3.5 | 2 |
Tiger Crystal 330mL Can | 3.8 | 5 |
Paulaner Weissbier Party Keg 5L | 3.8 | 5 |