DataWhys

PI: Andrew Olney, IIS/Psychology
Funder: NSF, $3,439,035

This project aims to serve the national interest by improving training in data science. Data scientists are needed to power the ongoing revolution in Big Data that is transforming virtually every sector of the economy. Progress in training data scientists is currently limited by a lack of understanding about how data science is learned and by a lack of techniques to optimize that learning. This project will advance understanding of how data science is learned by weaving together statistics, programming, and machine learning and experimental results about student learning. It will use this understanding to create an innovative Artificial Intelligence-enabled data science tutor called “DataWhys.” The DataWhys tutor can be integrated into JupyterLab, an established professional data science tool, and will provide 250 hours of training content.

To advance understanding of how data science is learned and how to optimize that learning, this project will identify the most effective scaffolds for worked examples across varying levels of expertise and identify when scaffolds should be removed. It will then compare a data science intelligent tutoring condition that implements these findings against worked example and pure problem-solving controls. This approach will synthesize previous work in the related fields of statistics, programming, and machine learning education, each of which has used only a few of the scaffolds and techniques that will be comprehensively investigated in this project. In addition to cross-sectional studies with college freshman, STEM majors, and graduate students, longitudinal studies will be conducted in partnership with the data science division of St. Jude Children's Research Hospital and through a summer internship for STEM majors from LeMoyne-Owen College. These longitudinal studies will provide additional evidence regarding workforce relevance through usability metrics and progress in personal learning plans. Source code and training material produced under the project will be publicly shared on GitHub where it can be freely used and modified by anyone under the open-source Apache license. This project is supported by the Accelerating Discovery: Educating the Future STEM Workforce program, which funds projects to educate the STEM workforce in the critical scientific areas defined by the Big Ideas for NSF Investment.