In this tutorial, we will predict the likelihood that students would succeed (or fail) in a particular assignment in a course. For this prediction task, we will build a classification model that accurately categorises students into two groups: those who achieve a score above 60 (i.e., success) vs others.
To be able to make this classification, normally we would need labelled data to train a machine learning model. However, if we want to offer an earlier help to the students who are likely to fail the assignment, we have to identify them before the submission period even starts. That is, we will have to make the predictions without using the true class labels so that they can be actionable in the context by the students (or instructors).
For this purpose, we will explore the use of two transfer learning techniques, namely in-situ learning and transfer across courses, to train models that can predict student engagement in an upcoming assignment. These techniques are simple to apply yet very powerful to generate actionable predictions.
In-situ learning is a transfer learning approach used in MOOCs (Massive Open Online Courses) [ref]. This approach involves in a particular learning context (1) deriving a proxy label based on student engagement (or success) in a past activity (such us a completed assignment), (2) training a predictive model with the proxy-labelled data, and (3) using the model to make predictions for the targeted future task.
The accuracy of predictions with in-situ learning highly depends on whether the capacity of features in predicting student success in the proxy task holds also for the predictions in the target task. That is, the way that students engaged/disengaged in learning activities (measured by the features) predicted their success/failure in the proxy task should implicate that if students maintain the same engagement/disengagement patterns, their success/failure in the target task should repeat (more or less) in the upcoming task.
Figure 1 below illustrate how we will apply the in-situ learning approach in the current context.
Let’s put into words what this figure tells us. First, we generate features (features1) using the student activity data until the date when the first assignment (a1) starts. a1 is already completed, which means we can generate labels based on student submissions (class_labels1). So, we have the feature set and the class labels to form the training set (trainingset1), which we will use to train a classifier model. Then, we will repeat the same procedure to generate the same features (features2), this time using the student activity data until the second assignment starts (a2). In the final step, we will feed these feature into the trained model to generate the predictions, which will be available before the activity itself starts.
Transfer Across Courses
The second technique to produce actionable predictions covered in this tutorial is transfer across courses. This technique involves training a model using the data from a past course and using this model in an ongoing course to make a predictions regarding student engagement/success in an upcoming task. For this technique to produce accurate results, similarity (or the sameness) between the courses in terms of their learning designs is a must. For this reason, transferring predictive models between different runs of the same course (with no or minor changes in the learning design) is highly likely to result in more accurate predictions. That is, the key is to transfer the affect of the learning design on student engagement and success.
For example, let’s say we train a model in the first run of a course where quizzes are mandatory elements connected with an assignment. Using several features about students’ quiz engagement and success, we train a model to transfer to another course with a different learning design where quizzes are optional and serve for practising. In this scenario, we may not expect a good performance since the role of quizzes in student learning is different in two courses, which will influence how students engage in quizzes and how this engagement relate to their performance in the assignment.
The following figure visualises the use of transfer across courses approach to predict student success in an upcoming assignment using a model trained with the data from the first run of the same course.
The Experiment Data
The Course Structure
Experiment data belongs to a (imaginary) course with the learning design provided in Figure 3). Please note that, for the scope of this tutorial we focus on the first 2 weeks (or modules).
Based on Figure 3, the first two weeks follow the same structure:
- Introduction: First page in the module to introduce the learning objectives,
- Content: The content page where the learning concepts are explained comprehensively,
- Video: The video explanation of the concepts,
- Practice Example: A practice example to apply the concepts in practice
- Discussion: A discussion forum to ask/answer questions about the concepts
- Quiz: A multiple-choice quiz (#1 could be taken only once; no constraints for #2)
- Assignment: The end-of-the-module assignment to be submitted by the student individually (graded on a scale of 0-100).
The Log Data
The data about students’ learning activities (also called trace data) that you have at hand determines the features that you can potentially generate to build your predictive model. The structure and the content of the educational data is affected by several factors. First, the general structure of the data is highly shaped by the learning design of the course. For example, if there are videos in the course, then the dataset is likely to include tables to record students’ play/pause activities on each video. That is, the educational data is mapped to the learning design of the course.
Second, the content of the data depends on the type of the student interactions (with the course components) that are traced. For example, many MOOC databases include logs of the page visits but do not tell the duration of each visit, which would not allow us to generate a feature about the time spent on ‘important’ pages (although it could be a significant indicator of student engagement).
In our experiment, we will use the following log data. Please download them and store them in a specific folder in your computer. We will start to explore them in the next section.
Exploring the Data and Generating the Features
OK, let’s start coding! Let’s run Jupyter Notebook, and navigate to the folder where the data files are saved. Then, let’s create a new Notebook file and name it as “feature-generation“.
Before we continue, we need to import several python libraries that we will use in this prediction task. First library is pandas (which stands for Python Data Analysis Library). It is the most widely used library in performing data science with python. Pandas can read a variety of data files (e.g., csv, tsv, xls, sql database, etc.) and it stores the source file in a data object called dataframe, which is composed of rows and columns (similar to the tables in SQL, Excel, and R). Dataframe comes with a rich set of features and methods that effectively facilitate the editing of the data. We will also import numpy, a scientific computation library, and matplotlib, a data visualisation library.
In the following code (Figure 3), we have the code to import these three libraries. Please note that by convention the code block to import libraries that we need throughout the notebook always go to the top of the page (generally in the first cell).
Data files that we will use in this exercise are:
These are the files you have downloaded from the google drive, and they should be placed in the same directory with the notebook file that you are currently working on.
Page view features
First feature set that we will create regards the page view activities of students. You may remember that there are three types of pages (i.e., introduction, content, practice example), and we want to generate one page view feature per each page type.
Let’s load the log data (called ‘page-view-logs.csv’) using the read_csv method of the Pandas, which receives the file name as the parameter. read_csv will convert the content of the csv into a dataframe, which will assign to a variable called df_pw_logs. Then, using the .head() method we display the top 3 rows of the dataframe, and using the .shape() method we display the number of rows and columns of the dataframe. The detail of the code is provided below:
In the following code, we create the page view counts for the Introduction pages. The code explanation is provided in the comments within the code. Please note that the code in the first cell creates the view counts for the Introduction pages in the first module, and the second calculates the same count for the second module. Note that the page ids for the first module are 1 and 2, and for the second module are 3 and 4.
You may notice that the code in the second cell repeats the code in the first cell. To prevent the redundancy in our code, we can define a function that returns the each students’ page views for a particular page requested by its id. Please see the code below:
Note that the code above is almost identical to the code we used with small difference that we filter the page counts by a specific page, provided in the page_id parameter to our function. With this function, we can compute the view counts for any page we want (i.e., page view features) and then we can combine all these features to create a single dataframe. Please check the following code for this purpose:
Everything looks perfect so far. Now, we will create the same type of features for the “Content” pages. We can create a new function adopted from the previous one:
But wait, we created the functions to prevent redundancy in our code at first hand but now we are repeating almost the same function for a different type of page. To avoid this problem, we can actually create a more generic function that filter the page views by type provided as another input. Please check the following code:
Using this generic function, lets generate all page view features and combine them into a single dataframe:
The code looks a little messy, right? This is because we had to call the function not only for each page type, but also for all possible page ids. Image what would happen if we had 10 different pages for each page type. Probably, it would become unmanageable.
This unsustainability actually calls for a better function: our generic function actually is not generic enough. One revision that we can incorporate is to automatically generate features for all pages available for a specific page type. That is, our function can identify the page ids for a given page type and create a page view feature for each identified page id. Please take a look at the following function (the explanation of the code is given through comments):
Now lets use our refined function to create the page view features and combine all of them:
The next feature set that we will create is the number of discussion posts made by each student for each discussion forum. This task sounds easier since we do not have “type” of discussion (like “page types”). Let’s first load the discussion post logs.
Now deriving from the previous generic function, we can easily write a function for generating the features about students’ discussion activities:
Using this function, we can easily create all the discussion features as seen below:
Video and Quiz features
Next, we will create two sets of feature: (1) the number of play/pause activities performed by each student for each unique video; and (2) the number of times students took each particular quiz. Let’s first take a look at these data:
The structures of both sets look very similar to the structure of the discussion data set. Taking advantage of this similarity, we can actually build a more generic function that we can use for discussion, video and quiz features. Please see the code below:
Note that this function requires 3 input parameters:
- logs: the log data (discussion, video, or quiz activities)
- col: the column where the unique id is stored (forum_id, video_id, quiz_id)
- feature_name: the name of the feature being created
Using this function we can create all features as seen below:
Now that all the features are created, we can save them to the computer as CSV files.
Training and Testing the Model
We have created all features that we will use to build our classification model. We will create a new notebook for this task, named ‘training and testing the modes’, in the same working folder. First, let’s import the libraries that we will use:
Next, let’s import the grades of the first assignment. Remember that the grades for both assignments are stored in the file “assignment-grads.csv”
As you see in the code above, we converted the student grades into a binary variable and assign it to a new column called “PASS”. Students with grades higher than 60 (n=539) are labelled as 1 (i.e., success), and the others (n = 261) are labelled as 0 (i.e., failure).
Let’s load the features we created starting from the page view features. You may remember that we stored all page view features (for both modules) into a single file. To train the model, we will use the features created from the Module #1 data. To make prediction and to test the model later, we will use the the features created from the Module #2 data. So, basically we need to separate these features into different sets as seen in the code:
However, writing all the column names (i.e., feature names) 1-by-1 is a little cumbersome. We know that the feature names ending with 1 or 2 belongs to the first module, and the names ending with 3 or 4 belong to the second one. These names are attached to the names of the features. We can take advantage of this to filter by the column names as seen in the code below:
We can follow the same approach to separate the rest of the features:
Now, we need to combine the features with the labels and display it:
Above, we used describe() method to obtain basic statistics about all the features.
Training the Model
Let’s train the model. We will use Logistic Regression as the classifier algorithm. After training the model, we can intend to predict the success/failure in the same assignment:
Unsurprisingly, the results are quite positive. This is because, we used the same dataset for training and testing. Note that we used also Area Under the Curve (AUC) measure to assess the model performance. In social sciences, AUC scores above .70 are considered a good performance. Note that we will use the classifier_grade1 soon to predict the scores of the second assignment.
Let’s check the coefficients assigned to each feature:
You may see that some of the features are negatively related with the grades. These coefficients are likely to be affected by the learning design of the course. This is because, the learning design shapes how students engage in learning activities, which somehow relate with their assignment grades.
Testing the Model
Let’s load the scores of the assignment data and then merge them with the features for the second module. We repeat the same steps as we did for the first module.
Now let’s feed the features to the classifier_grade1 that we have created above and make some predictions:
As you can see the results, the accuracy of the predictions has dropped. However, still the model trained based on the first assignment did not perform very bad in predicting the achievement in the second assignment. The results could be improved by exploring the influence of the learning design on student activities and mining more features.
Let’s assume a base model which in which predictions are generated from the first assignment. That is, if students failed the first assignment they will be predicted to fail in the second assignment as well. Here is the code:
According to the performance scores, the base model performed a lot worse than our predictor model. If it were to perform very similar, then one would argue that building the model is not necessary as the success in the second assignment could be predicted simply by their performance in the first assignment.
Just for curiosity, let’s train a model with cross validation (CV) using data2, and check the coefficients assigned to the features. This way we can see the match with the CV model trained for the first assignment. Probably, there are many differences, which has resulted the low performance above.
If you compare the coefficients with the previous ones, you will notice many differences and some similarities. Actually, this comparison can help us understand how students behaved in each model and why and reflect on the reasons why we could not obtain very accurate results, and how we could improve them with new features. To be able to do this, we would need the learning design of the course. Given that the data for this tutorial was imaginary, we will leave this task for future.