FedEx Institute of Technology

Current Data Science Projects

Data Reduction for Big Data

Lih-Yuan Deng and Dale Bowman

When a dataset is too big for computer memory, traditional statistical and many machine learning methods
are often not applicable. Data-reduction techniques on big data include: (1) dimension reduction and (2)
data subsampling. As statisticians, we are uniquely qualified to study the feasibility of various survey
sampling techniques on big data sampling. Our expertise on design of experiment also enables us to study
the problem of design-based subsampling. Recently, a paper published in JASA proposed informationbased
optimal subdata selection (IBOSS) by choosing data points with extreme values on one of many
dimensions. Potential weaknesses include: (1) highly sensitive to outliers (2) assumes (unrealistically)
the "best" statistical model is known (3) inefficient for ultra-high dimensional data. It is essential that
subsamples chosen be representative of the full data set so that additional analysis will yield consistent
results. We propose a new subsampling algorithm using space-filling design on key principal components
values.

Applying a systems medicine approach to epidemiologic research and public health

Fawaz Mzayek

The MRFIT clinical trial was designed to test the effect of multi-faceted prevention intervention on reducing cardiovascular mortality. We plan to use the extensive MRFIT data to assess the longitudinal, combined effect of many health determinants on risk of developing two major public health problems, coronary heart disease (CHD) and diabetes, using a two-step analytical approach: 1) underlying latent factors will be identified using factor analysis, 2) identified factors will then be used to predict the two outcomes. This novel approach addresses the extremely complex nature of the processes underlying disease by examining the combined effects of a large number of health determinants on the development of human disease (as opposed to the classical paradigm, where the effects of one/two potential risk factors are tested at a time.) This study will potentially provide new information for better understanding CHD and diabetes risk and inform targeted early prevention and management.

Integrating firm and event data from structured and unstructured data sources

Ali Mahdavi Adeli

Empirical studies that address interesting firm related phenomena in business and economics often require integrating structured and unstructured data from multiple sources. This data integration and preparation task is highly resource-intensive given the messy nature of the data (Boritz and No 2013; McElreath and Wiggins 2006; Neumaier et al. 2016; Srivastava et al. 2019) and has important consequences for research findings (Cadman et al. 2010; Guenther and Rosman 1994; Ulbricht and Weiner 2005). Some studies have recommended guidelines and standardized data processing procedures for specific databases (Balasubramanian and Sivadasan 2010; Thoma et al. 2010), but researchers still have to formulate their own study-specific process(es) even when using popular sources such as COMPUSTAT or US patent data. This study aims to help the feasibility, replicability, and comparability of such studies by providing a standardized approach for integrating large structured and unstructured

Predicting short term mortality with machine learning in lung cancer patients

Xihua Yu

Lung cancer is the leading cause of cancer mortality, and the five year relative survival rate is only 18.6%. Aggressive cancer treatment may not be appropriate if patient's life expectancy is less than six months. Predicting deaths within six months will facilitate treatment decision for both physicians and patients. However, no risk prediction model on short term mortality has been developed for lung cancer. Using cancer registry (SEER) data linked with Medicare claims, we will develop and compare the predictability of traditional logistic models with machine learning models (Ensemble decision tree with boosting or random forest, and recurrent neural network with gated recurrent unit). Our models will incorporate both multi-dimensional predictors and time sequence of medical encounters from Medicare claims. This study will serve as the basis for seeking opportunities of external collaboration to develop more advanced models incorporating pathophysiological measures and clinical notes.

Development of Fully Automated Software Systems for Large Data Acquisition and Real Time Data Analysis

Thang Hoang

Data science is a thread that runs through many research fields and uses different methods, algorithms to generate, store and extract information from data in various forms. Any researcher in any field has to deal with data either on a small or large scale. In materials science, including nanoscience and nanotechnology research, the acquisition and handling of data require complicated computer software systems due to the nature of the small size of the objects in study and the large amount of data generated. Here, we propose to develop a set of computer software, using the Labview graphical programming language, that can interface with commonly used instruments for materials science research for large data acquisition (up to several gigabytes) and analysis, mining in real time. This set of software, including source codes, can be made public to the UofM community and be used for training in any UofM Labview courses.


Explainable Inference Methods for Relational Data

Deepak Venugopal

Machine learning has arguably been one of the most important drivers in the data revolution that we have witnessed over the last decade. Specifically, techniques such as deep learning have achieved unparalleled success in complex problems such as computer vision, language processing, etc. even surpassing the performance of human experts in several tasks. However, despite all these successes, one of the key challenges that needs to be addressed is to develop human-interpretable reasoning methods. Recent work in explainable learning assumes that data instances are independent of each other. However, real-world data is inherently relational in nature (e.g., medical records, social networks, etc.). The aim of this proposal is to develop foundational algorithms for explainable inference in relational data, and apply them to practical problems. We will develop our methods on top of Markov Logic Networks (MLNs) which is arguably the most popular probabilistic relational model and in which the PI is a well-recognized
expert.


The Evolution of Civil Rights Enforcement and Economic Prosperity of Minorities

Jamein P. Cunningham and Jose Joaquin Lopez

Title VII of the Civil Rights Act of 1964 and title VIII of the Civil Rights Act of 1968 provided provisions for remediating grievance related to discrimination in employment and housing based on person characteristics. We analyze changes in civil rights complaints in U.S. district courts between 1964 and 2014 using Federal Court Cases: Integrated Data Base on Civil Terminations to better understand trends in the type of cases and trials brought by plaintiffs as well as the verdicts made on cases involving complaints of discrimination. In particular, we study the effects of fee-shifting statutes such as the Civil Rights Attorney's Fee Award Act (1976) and the Equal Access to Justice Act (1980) on the types of complaints brought to trial. Last, we study the impact of these changes on the economic outcomes of minority groups in the United States.


Poverty Prediction Using Deep Learning, Social Network Analysis, Education and Ethnicity – The case of City of Memphis

Chen Zhang and Srikar Velichety

We develop prediction models for poverty rates using data related to infrastructure development and revitalization projects, neighborhood characteristics, education and ethnicity measures for Shelby County. We combine data from a variety of sources including US Census Bureau, City of Memphis CIO's Office, University of Memphis Office of Institutional Research and Google Earth Pro Satellite imaging. Using historic satellite imagery of census tracts, we construct infrastructure development measures by uniquely combining Convolutional Neural Networks (CNN), for image classification, and Recurrent Neural Networks (RNN), for identifying sequential development. We also quantify the spatial impact of neighborhood characteristics on the poverty rate of a location using social network analysis. In doing so, we provide ways in which researchers can leverage a combination of advanced deep learning and social network analysis techniques to come up with relevant measures.


Modeling of stochastic physical processes by analyzing collision time distributions

Ranganathan Gopalakrishnan

The proposed research aims to develop an analysis technique to parameterize ion-ion collision time distributions obtained by numerically solving the linear stochastic Langevin equation. The methodology includes analyzing the distributions to infer their variation with physical parameters. The Langevin equation is a powerful tool to reduce N-body interactions in gas-phase systems to 2-body interactions, thereby making it computationally inexpensive in the modeling of aerosols, dusty plasmas and ionic systems. Preliminary data outlining our approach is presented in this proposal along with details from a manuscript that is currently in review on this topic. A work plan for Summer 2019 that will result in a peer-reviewed publication and data for several external grant proposals is outlined. The culmination of this study will be a modeling approach to summarize solutions to the linear stochastic Langevin equation calculations of ionion collision times into accurate predictive models that consider gas density (pressure and temperature), ion shape, mobility and mass, and ion-specific potential interactions that accurately describe the chemical physics.


Preliminary study of a novel unbiased machine-learning approach, hybrid clustering with multidimensional vectors

Hongmei Zhang and Bernie J. Daigle

There is growing recognition that the pathogenesis of many common diseases is a consequence of concerted activities of genetics and epigenetics, and a single data type is often insufficient to explain their pathophysiology. However, one challenge is the lack of appropriate methods to integratively analyze multiple data sources, particularly for "omics" data such as genome-scale genetic and epigenetic data. Current analytical techniques suffer from at least two critical limitations: linearity and additivity assumptions when integrating different sources. These assumptions are known to be biologically unrealistic. Discarding these assumptions, we will develop a computationally efficient non-parametric method to examine patterns in gene features via novel clustering on integrated genetic, DNA methylation (DNAM), and gene expression (GE) data. The method is a hybrid of divisive and agglomerative clustering with partitioning around medoids (PAM). The goal of this proposal is to produce preliminary results on this novel approach to prepare for resubmission.