X
This article outlines an approach for automatically extracting behavioral indicators from video, audio, and text and explores the possibility of using those indicators to predict human-interpretable judgments of involvement, dominance, tension, and arousal. We utilized two-dimensional spatial inputs extracted from video, acoustic properties extracted from audio and verbal content transcribed from face-to-face interactions to construct a set of multimodal features. Multiple predictive models were created using the extracted features as predictors and human-coded perceptions of involvement, tenseness, and arousal as the criterion. These predicted perceptions were then used as independent variables in classifying truth and deception. Though the predicted values for perceptions performed comparably to human-coded perceptions in detecting deception, the results were not satisfying. Thus, the extracted multimodal features were used to predict deception directly. Classification accuracy was substantially higher than typical human deception detection performance. Through this research, we consider the feasibility and validity of the approach and identify how such an approach could contribute to the broader community.