Predicting Online Course Popularity Using LightGBM: A Data Mining Approach on Udemy's Educational Dataset
DOI:
https://doi.org/10.63913/ail.v1i2.11Keywords:
LightGBM, Online Course Popularity, Machine Learning, Udemy, Predictive ModellingAbstract
The increasing demand for online education has led to a rapid expansion of platforms such as Udemy, where predicting the popularity of courses can provide valuable insights for course creators and platform managers. This research aims to predict the popularity of online courses on Udemy using LightGBM, a powerful gradient boosting framework that is well-suited for classification tasks. The study begins with a dataset overview, which includes key course features such as payment type (is_paid), price, number of lectures, course level, content duration, subject, published timestamp, and number of subscribers. The preprocessing steps involved handling missing values, encoding categorical variables, and extracting temporal features from the publication date to capture trends over time. Exploratory Data Analysis (EDA) is conducted to uncover patterns and relationships within the dataset, including descriptive statistics and visualizations to understand distributions and correlations between variables. A correlation heatmap is used to identify significant associations between the predictors and the target variable, course popularity (measured by the number of subscribers). The core of the study employs the LightGBM model, which is trained using a train-test split approach and evaluated based on performance metrics such as accuracy, precision, and recall. The results show that features such as the number of lectures, price, and content duration have the greatest influence on course popularity, while certain features like course level show a limited impact. A comparative analysis with a baseline model reveals that LightGBM outperforms simple mean-based predictions in terms of predictive accuracy. The findings underscore the importance of course content structure and pricing strategies for increasing enrollment. Finally, the study discusses limitations, such as the lack of course quality metrics, and suggests avenues for future research, including the exploration of more advanced machine learning techniques and incorporating additional data sources for a more comprehensive model.