Improving accuracy of students’ final grade prediction model using optimal equal width binning and synthetic minority over-sampling technique

Jishan, Syed Tanveer; Rashu, Raisul Islam; Haque, Naheena; Rahman, Rashedur M

doi:10.1186/s40165-014-0010-2

Research
Open access
Published: 12 March 2015

Improving accuracy of students’ final grade prediction model using optimal equal width binning and synthetic minority over-sampling technique

Syed Tanveer Jishan¹,
Raisul Islam Rashu¹,
Naheena Haque¹ &
…
Rashedur M Rahman¹

Decision Analytics volume 2, Article number: 1 (2015) Cite this article

13k Accesses
86 Citations
2 Altmetric
Metrics details

Abstract

There is a perpetual elevation in demand for higher education in the last decade all over the world; therefore, the need for improving the education system is imminent. Educational data mining is a newly-visible area in the field of data mining and it can be applied to better understanding the educational systems in Bangladesh. In this research, we present how data can be preprocessed using a discretization method called the Optimal Equal Width Binning and an over-sampling technique known as the Synthetic Minority Over-Sampling (SMOTE) to improve the accuracy of the students’ final grade prediction model for a particular course. In order to validate our method we have used data from a course offered at North South University, Bangladesh. The result obtained from the experiment gives a clear indication that the accuracy of the prediction model improves significantly when the discretization and over-sampling methods are applied.

Background

Educational Data Mining (EDM) is an interdisciplinary research area that fixates on the utilization of data mining in the educational field. Educational data can be from different sources, but generally from academic institutions, but nowadays, online learning systems are also the incipient environment for acquiring educational data which can be habituated to analyze and extract utilizable information (Romero & Ventura 2010). The goal of the research is to predict the students’ performance using attributes such as Cumulative Grade Point Average, Quiz, Laboratory, Midterm and Attendance marks. However, in order to improve the prediction model we introduced some preprocessing techniques so that the prediction model provides with more precise results which could be used to alert students before the final examination regarding their final outcome.

We received the course data and student information from the North South University. After acquiring the data we preprocessed it and then applied three classification algorithms, e.g., Naïve Bayes, Decision Tree and Neural Network. In order to improve the model we looked into the techniques at the data preprocessing level. At first we discretized the continuous attributes using optimal equal width binning as proposed by Kayah (2008) and then used Synthetic Minority Over-Sampling (SMOTE) technique (Chawla et al. 2002) to increase the volume of the data, provided that there were limited instances in the acquired data. There are four forms of the preprocessed data: normal acquired data, data with discretization technique applied, class balanced data using oversampling and the data where both the discretization and oversampling methods were used. We build twelve models by preprocessing the data in four different ways mentioned and using three classification techniques mentioned earlier. After all the models were built we compared their accuracy, precision, recall and F-measure of the class labels for those models. ROC Curves for each of the models are generated and Area Under the Curves (AUC) are also calculated and compared.

Related works

Educational Data Mining is a vast domain which consists of different applications. Using data mining techniques it is possible to build course planning system, detecting what type of learner a student is, making group of similar types of students, predicting the performance of the students as well as helping instructors to get insight on how to commence the classes (Romero & Ventura 2010). Pal and Pal (2013) conducted studies at the VBS Purvanchal University, Jaunpur, India and used classification algorithms to identify the students who need special advising or counseling from the teachers.

Ayers et al. (2009) used several clustering algorithms such as hierarchical agglomerative clustering, K-means and model based clustering in order to understand skill levels of the students and group them based on their skill sets. Bharadwaj and Pal (2012) found that students’ grade in the senior secondary exam, living location, medium of teaching, mother’s qualification, family annual income, and student’s family status are correlated strongly and help to predict how the students perform academically. In another study Bharadwaj and Pal (2011) used students’ previous semester marks, class test grade, seminar performance, assignment performance, general proficiency, attendance in class and lab work to predict the end of the semester marks.

A comparison of machine learning methods has been carried out to predict success in a course (either passed or failed) in Intelligent Tutoring Systems (Hämäläinen & Vinni 2006). Nebot et al. (2006) used different types of rule-based systems have been applied to predict student performance such as mark prediction in an e-learning environment using fuzzy association rules. Several classification algorithms have been applied in order to group students, such as: discriminant analysis, neural networks, random forests and decision trees for classifying university students into three groups such as low-risk, medium-risk and high-risk of failing (Superby et al. 2006).

Zhu et al. (2007) explains how making a personalized learning recommendation system which will help the learner beforehand what he or she should learn before moving to the next step. Yadav et al. (2012) used students’ attendance, class test grade, seminar and assignment marks, lab works to predict students’ performance at the end of the semester. They used the decision tree algorithms such as ID3, CART and C4.5 and made a comparative analysis. In their study, they achieved 52.08%, 56.25% and 45.83% accuracy of each of these classification techniques respectively.

Prati et al. (2004) discussed about recent works in the field of data mining to overcome the imbalanced dataset problem. They mainly focused in concepts and methods to deal with imbalanced datasets. Chawla et al. (2002) found that majority class and minority class both have to equally represent in classification category for balanced dataset. They used combination of the method of over sampling the minority class and under sampling the majority class to accomplish the better classifier performance in ROC space. They mainly introduced the Synthetic Minority Over-sampling approach which provides the new technique in over sampling and intercourse with the under sampling makes the better result.

Chen (2009) used several re-sampling techniques for finding the maximum accuracy of classification from fully labeled imbalanced training data set. SMOTE, Oversampling by duplicating minority examples, random under sampling, is mainly used to create new training data set. Standard classifiers like Decision Tree, Naive Bayes, Neural Network are trained in this data set and all the techniques show improved accuracy except Naive Bayes. Rahman and Davis et al. (2013) tried to address class imbalance issue in medical datasets. They used undersampling techniques as well as oversampling techniques like SMOTE to balance the classes.

There are some works done using Neural Network to predict students’ grade. Gedeon and Turner (1993) compared different types of neural network models which have been used to predict final student grades primarily; they mainly used backpropagation and feedforward neural networks. Want and Mitrovic (2002) used feedforward and backpropagation to predict the number of errors a student will make. Oladokun et al. (2008) used multilayer perceptron topology for predicting the likely performance of a candidate being considered for admission into the university.

We can notice that there is handful of works on grade prediction models, however, our focus was to address the issue of class imbalance and discretizing the continuous attributes effectively instead of taking an assumption such as, normal distribution. The primary goal was to observe whether synthetic minority oversampling method and optimum equal width binning together will result in better performance of the grade prediction models provided that most of the attributes in course mark sheets or data sets are continuous in nature and the number of instances were low.

Methods

Data selection

The dataset we are using contains 181 instances which is the number of students enrolled in the course during the prior 18 months. This dataset is from a course titled “Numerical Analysis” which is a core course in EEE disciple in North South University, Dhaka, Bangladesh. Originally the dataset had student ID, student name, five quiz marks, midterm marks, attendance, laboratory marks, final marks and final grade as attributes. We have selected the attribute which contains the percentage of marks obtained by the students in quizzes rather than taking all the quizzes into account. Final grade is considered as the class label. The same dataset is used for creating the over-sampled dataset where the number of instances is 360.

Data preparation

At first we discarded the Students’ ID in the dataset provided that it is not directly required for classification. Students’ CGPA, which was not initially a part of the dataset, it was retrieved and added as an attribute. All the attributes which are used for classification are listed in the Table 1.

Table 1 Attributes of the dataset

Improving accuracy of students’ final grade prediction model using optimal equal width binning and synthetic minority over-sampling technique

Abstract

Background

Related works

Methods

Data selection

Data preparation

Balancing the dataset using synthetic minority over-sampling

Handling continuous data using probability distribution function

Handling continuous data using optimal equal width binning

Naive Bayes for classification

C4.5 algorithm for classification

Backpropagation algorithm for classification

Implementation of the models

Results

Naive Bayes classification

Decision tree classification

Classification using neural network

Receiver operating characteristic (ROC) curve comparisons

ROC curve comparisons after oversampling using SMOTE

Summary of the analysis

Pearson correlation coefficient for the validation measures

Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing Interests

Rights and permissions

About this article

Cite this article

Share this article

Keywords