Part 2: Creating a workshop on Data Mining

Linh Ngo
Clemson University

After the first workshop on Data Mining was offered, we have had several follow-up appointments from participants of the workshop. While majority of these meeting involved clarifying details from the workshop, several of them were about techniques to developing specific data mining scenarios in actual research work, which were not covered in the workshop.

Based on previous workshop’ user feedback and contents of the follow-up meetings, we decide to modify our workshop to reflect users’ needs. More specifically, we decide to drop the following materials:

From Module 1, we remove the sections about real time data mining (acquiring data from data feeds and streaming Twitter data).
Module 2 is dropped completely.
From Module 3, we drop the sections about inserting data into SQL storages. Instead, we repeat the lessons learned from Module 1 about saving and organizing mined data as HTML files.

We also add a new module on mining online data via a headless browser to support data collection on websites that deliver data as dynamic packages, which are not included with the original HTML source code. In addition, we also added a number of challenges for users to try out on their own. The modules for the updated workshop were organized as follows:

Module 1: Data acquisition.

In this module, we first show participants how to download from a static link using R. We emphasize that even in the case where there is only one data file to be downloaded, it is preferable to embed the downloading process into the workflow rather than to have two separate processes, one to manually download the data and the other to load the data into R. Next, we leverage this knowledge to demonstrate how to download multiple data files from websites that have a consistent structural organization (i.e. NCSES’ academic institution profiles). The acquired data are compressed Excel files (tables with header and footers). Throughout the acquisition process, we can also show participants issues such as how to organize downloaded data.

Module 2: Online data crawl from static websites.

In this module, participants learn how to mine data directly off of web pages (i.e., Yelp, AirBnb, Ebay …) through R. Using Yelp as an example, we demonstrate how users can first acquire the HTML data then to extract relevant data elements by identifying the corresponding tags via examining and comparing the in-browser HTML source. Extracted data are organized and inserted into HTML files, and are post-processed afterward.

Module 3: Online data crawl from dynamic websites using headless browser.

We come back to another form of web-based data acquisition in this module. This time, the desired contents are not part of the original HTML source even though they are displayed on the browser. To acquire these data, participants learn how to use R to drive a headless browser (phantomjs) that can load and save the page source code as it appears, and subsequently extract the data.

Module 4: Data mining via large-scale computing resources

In this module, participants will reorganize and combine the codes from the previous module into an R script, which can be embedded into a PBS submission script. Participants will learn how they can submit this script and let Clemson University’s Palmetto Supercomputer run the data mining process for them.

The updated version workshop was offered to Clemson’s research community in October 2017. The attendance of this workshops has dropped to 60% attendance. Similar to the first workshop, the primary participants come from non-traditional research computing areas such as economics and management. Post workshop survey shows a rating of 4.9 out of 5, a significant improvement from the previous version of the workshop. Half of the feedback do not have anything negative to say about the workshop. The remaining critical feedback still thinks that the workshop is “a little too fast” and “sometimes very fast.” On the other hand, one positive feedback states that they were able to follow the workshop even with minimal R knowledge. The critical feedback also call for more challenges to be done by the participants. The updated version of the workshop has become the master branch on https://github.com/clemsonciti/data-mining-r-workshop, and the previous version is stored on the 1.0 branch.

In conclusion, data mining skills have been considered a critical skill across different disciplines, particularly business, humanities, and social sciences. By developing and adjusting a data mining workshop based on users’ feedback, we were able to properly address a training need at our institution.