Assorted design notes documenting decisions.
This document records design decisions, explaining them and where appropriate highlighting future implications. Though be aware it is rather a brain dump.
The first design decision to note is that these scripts are not intended as a fully refined package. They began as individual scripts to meet the evolving needs of the Measuring qualification effects project. In the follow on Refining a framework project, additional scripts were developed covering additional functionality, and some refactoring was applied. There isn’t the time and money available to address all of the technical debt in these scripts. However, they do do what they are intended to do. So that’s all right.
From the original MQE project analysing student forum posts we apply four criteria, all implemented as regular expressions in mqe_regexs
:
While you could get away with a simple string search for 'http' it is not implemented that way in this framework.
Python’s string find method does work, as a means of searching for http in a post. Indeed, if passed 'http' it will match against both 'http' and 'https'. The former being a substring of the latter. However, this approach does mean that any use of 'http' in the post will match. There are very few uses of 'http' in the text of a post, and as it happens in all cases for which we currently have data, that post also has exactly the sort of external link we are looking for. However, there is no guarantee that this will hold true in future forum analyses.
Therefore, the search for hypertext protocol as an indicator of an external link is achieved using a regular expression that looks for a following colon too.
Forums with 'preferences' in them, found when looking for 'eferences':
Forum regex select string select w w/o tot w w/o tot M813 15E Software issues 4 48 52 5 47 52 M813 14E Tracking the leading edge forum 187 279 466 189 277 466 M813 14E Organisation and scope forum 14 544 558 15 543 558 M811 15K Module discussion 711 1476 2187 713 1474 2187
Simple inline reference favoured by students because it is easy to type into a forum post using the VLE editor.
Seems to work as a proxy for the full Harvard-style reference that students are meant to use.
The scripts are run directly from a command line. To change the folders a script uses, you need to edit the script. Alternatively, the folders could be CLI arguments passed when the script invoked, or be selected via a GUI.
Only the final version of a post is included in the analysis. Earlier drafts are excluded by testing the posts’s 'oldversion' value. This test is implemented in the mqe_common.is_current()
function, used in several scripts.
All scripts are Python so we use this efficient form of data storage. Might be an issue if data has to be shared with tools written in another language, in which case swap to JSON or YAML.
A core use of the framework is to filter forum posts based on four filtering criteria. See Filtering criteria. To assist refining the criteria, the reports-filterfalse folder contains reports showing all the posts that did not meet the filtering criteria.
Each forum has an *.html file, listing the excluded posts. For each listing, there is the id of the post, the userid of the post’s author, the post’s creation date and the post itself in its original html format.
The workflow to produce these reports matches that to produce the filtered reports: reading forum XML data current posts only are selected and then filtered. In this case, the filter criteria are reversed, so that posts not meeting the criteria are written to the filterfalse report.
The reports are produced by the filterfalse_posts
script. The script is derived from the filter_posts
script by:
filter_forum()
, the function being renamed filterfalse_forum()
in consequence,_filter
epithet from the report file names, andThese reports enable an academic wishing to refine the existing filter criteria, or to develop new filter criteria, to review only those posts that are excluded by the criteria. Hence, the academic is helped to engage in a focused review of these excluded posts only.
While the intention of this project, and subsequent projects, is to study the effect of teaching interventions in the exported Moodle forum XML data we have only the anonymised Moodle UserID. This means it is not possible to distinguish between tutors, moderators and students. A consequence of this can be seen in the study path results. The study path analysis, produced by trace_userids.py, is meant to show the path of study across courses taken by students. On reviewing the study-path results when testing the script, it was immediately apparent that several IDs, such as 016185, are present in both M816 14K and M816 15K, which seemed curious. Looking at the appropriate forum post web pages, it quickly became obvious that this userid is that of tutor who worked on both presentations. So, the reports and table include staff, that is tutors and moderators, as well as students. Hence the use of the term userid rather than student in the output. However, most staff enrolled on Moodle some years ago and hence have userids of a much lower number to their students, though that his not a reliable way to distinguish them. A consequence of this confusion is that staff posts are subject to the same filtering and classifying as student posts. Therefore, for future use, it is recommended to compile a list of staff Moodle userids so that they and their associated posts are excluded from processing.
The two study-paths outputs produced by trace_userids.py simply report if a userid, be it that of a student or a tutor, has posted to one of that presentation’s forums. Therefore, there is the possibility that a student will not be recorded; the userids of active participants only are captured. In the study to date, all students are active participants of the forums and so are traced, The courses require students to contribute to a forum as part of an eTMA. Clearly, when applying the framework in future, this may not apply to other courses.
To reduce the work of analysing forum text in suggest_keywords.prepare_text()
, stopwords are removed. The list used is that developed by the University of Glasgow’s School of Computer Science, downloaded from <http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words>.
This is used in preference to nltk’s list of stopwords because:
stop_words.txt
To make the keyword suggestion processing and output manageable, the framework follows standard practice and excludes non-alphabetic ‘words’ in the text such as numbers, words of only one or two characters, and stop words, ie common words such as ‘and’, and ‘the’. See the function suggest_keywords.prepare_text()
.
Moodle Backup Zip files, file extension .mbz
, are actually tar files using gzip for compression. They are our means of downloading the raw forum xml data from Moodle.
LTS could not provide us with the forum xml data used by Moodle, but LTS could provide us with a full backup of the course. Therefore, we accepted a workflow that started with LTS providing us with this manual backup from Moodle followed by our manual extraction of the forum xml from the backup using 7-zip.
Development of an extract_forum
script was halted during the project because it was being built on the basis of a manual full dump of the course and we were unsure if this was a long term method of acquiring the forum xml data.
However, since the project ended I have completed extract_forums
script to automate the extraction of forum xml data from Moodle .mbz
backups because I do not think that there will be an alternative workflow providing a direct export from Moodle of just the chosen forum files any time soon. Be aware, this may change!
matplotlib
suggests that you do not trace more than 20 graphs at a time, owing to its default memory constraint. The sample forum data contains 26 forums. Hence, draw_posts
draws 26 plots and matplotlib
issues a warning. This warning can be ignored because the script is generating simple line charts only, and will not challenge matplotlib
’s memory constraint.
Looking to produce a simple visualisation in the framework, I used the well known Python package matplotlib for this task. There are no magic criteria involved in this choice over say Pillow, simply that matplotlib does the job easily and is an package that is going to stay around.
Not knowing exactly which NLP techniques would be required by the project, I chose to use nltk as a Python package that should cover all likely requests.
One approach explored and rejected in developing this framework was the use of Named Entity Recognition (NER) techniques to filter forum posts. We chose to look for person names in messages as a possible filter as a possible indicator that students were making useful cross-references. An ad hoc script was developed using nltk for NER to write a report containing those forum posts that included a person’s name. A sample post entry from the report is shown below:
> Post: 11776320 UserID: 16185 Message: <p>Hi Ben</p><p>The materials are provided in Ebooks and Kindle formats for those wishing to read on the move. Better still, is the use of the OUAnywhere app - (Apple/Android).</p>…
As can be seen in this sample, the script has correctly identified this post as having a person’s name in it: Ben. However, the name is not the desired reference to an external authority, but a greeting to a student (This post was a tutor’s reply to that student’s post.) This sample is taken from M816 15K’s Module discussion forum, which has a total of 280 posts. Of those posts, 202 are identified as containing named persons and 78 without. Hence, the technique is not sufficiently discriminating to aid analysis of the forum posts.
NER as a technique has been applied successfully to filter forum posts. However, in the context of the available forums more consideration is required before it can be used to discriminate forum posts meaningfully. Other forums, serving different pedagogical directions, may benefit from the application of NER, selecting other entities such as organizations.
Not using NER leaves only tokenization and stemming as services provided by nltk. There is possible future work that calls on NLP techniques that are met by nltk. In their absence it is possible to remove the external dependency on nltk by implementing a simple tokenizer (remove punctuation and split on whitespace) but stemming remains a problem. You will not want to write your own stemmer. Therefore, you still need to use a Python package. An excellent Python package alternative to nltk for stemming is snowballstemmer, especially in conjunction with PyStemmer. Strongly recommended, because it is far faster than nltk. Note, it will give slightly different results to the existing use of nltk in the framework, because nltk is set to use the Lancaster stemming algorithm whereas snowballstemmer uses Porter2. Both algorithms are an advance on Porter, and in the context of this research it doesn’t matter which we use.
Using itertools.groupby()
to group on date just not worth it! More trouble than it’s worth having to work out how to consume the multiple iterators it would return.
It might appear tempting to flatten the nested if statements for is_current()
and has_links()
in keeping with the zen of python. However, this has an unintended consequence. If flattened the else statement processes all previous edits of a post, whereas we want to analyse the final versions of posts only.
The purpose of the else statement is to let us count of the number of posts that do not include external links. Hence, we cannot remove it. Simply subtracting the number of posts with links from the total number of posts gives a misleading number, because draft versions of a post are included.
Similarly for filterfalse_posts.filterfalse_forum()
.
But this is not true for track_filtered_posts.get_post_metadata()
because there is no else statement. We want to process all posts that are the final version and have external links, and ignore the other posts. Hence, we flatten the if statements in this function.
The extract_assignment
script has a selection sequence based on internal OU codes for assignment types of which:
These values are hard-coded into the extract_all()
function as a quick and easy hack. In an ideal world, this could be a lookup function defined in mqe-common
, but time is short, and in deference to YAGNI, let’s have some technical debt.