Linh Ngo
Clemson University
By Spring 2017, the Cyberinfrastructure and Technology Integration (CITI) Department at Clemson University has offered a range of introductory research computing workshops. Contents covered by these workshops include Linux, Github, Python, R, Hadoop MapReduce, and Apache Spark. However, post-workshop surveys indicate that participants want to have access to more advanced workshops. One of the most requested topics is data mining.
Module 1: Data acquisition.
In this module, we first show participants how to download from a static link using R. We emphasize that even in the case where there is only one data file to be downloaded, it is preferable to embed the downloading process into the workflow rather than to have two separate processes, one to manually download the data and the other to load the data into R. Next, we leverage this knowledge to demonstrate how to download multiple data files from websites that have a consistent structural organization (i.e. NCSES’ academic institution profiles). The acquired data are compressed Excel files (tables with header and footers). Throughout the acquisition process, we can also show participants issues such as how to organize downloaded data, programmatically decompress data in R, and preliminary check for missing data and data consistency. In the remainder of this module, participants learn to acquire data from data feeds (RSS) and streaming data (Twitter).
Module 2: Data curation.
This module focuses on introducing users to JSON and XML data formats and how to convert these data formats into R’s data frames and lists for further analysis. While there exists a number of JSON and XML libraries in R to support this conversion, this module provides examples about how to identify and interpret the data tags of XML and JSON tags to create the corresponding data frames and lists.
Module 3: Online data crawl and SQL storages.
We come back to another form of data acquisition in this module. This time, participants learn how to mine data directly off of web pages (i.e., Yelp, AirBnb, Ebay …) and store them in SQL databases accessible through R. To ensure that everyone can have access to a SQL database, we use SQLite as the backend database. Using Yelp as an example, we demonstrate how users can first acquire the HTML data then to extract relevant data elements by identifying the corresponding tags via examining and comparing the in-browser HTML source. Extracted data are organized and inserted into SQL tables (backed by SQLite) for future access.
Module 4: Data mining via large-scale computing resources
In this module, participants will reorganize and combine the codes from the previous module into an R script, which can be embedded into a PBS submission script. Participants will learn how they can submit this script and let Clemson University’s Palmetto Supercomputer run the data mining process for them.
The workshop was first offered to Clemson’s research community in July 2017. This was one of the few CITI workshops that achieved 100% attendance. The primary participants come from non-traditional research computing areas such as economics and management. Post workshop survey shows a rating of 4.2 out of 5, and many participants expressed verbal compliments and regrets that the workshop were not offered earlier. At the same time, a number of critical feedback about the limitations of the workshop was also provided. In trying to cover the key concepts of data mining, we have actually tried to cram too much contents into the workshop. As a result, we were not able to go through all the contents of the modules. Some of the dropped contents include acquiring stream data from Twitter and hands-on section of data mining using supercomputer. The data curation module, while sounds good in theory, does not exactly fit into the flow of workshop. Examples of feedback regarding these problems include “Very fast and unclear”, “Maybe teach too much”, “Too many content”, “Too many things”, “i am tired”, “It’s is a too fast for me”, “some of the structured elements were confusing”, “Too little time a lotted”, and “it goes very fast.” The current version of the workshop is located at https://github.com/clemsonciti/data-mining-r-workshop. A new version of the workshop is being developed to address these feedback. This workshop will be offered during October 2017.
To be continued ….