This article outlines an approach for automatically extracting behavioral indicators
from video, audio, and text and explores the possibility of using those indicators
to predict human-interpretable judgments of involvement, dominance, tension, and arousal.
We utilized two-dimensional spatial inputs extracted from video, acoustic properties
extracted from audio and verbal content transcribed from face-to-face interactions
to construct a set of multimodal features. Multiple predictive models were created
using the extracted features as predictors and human-coded perceptions of involvement,
tenseness, and arousal as the criterion. These predicted perceptions were then used
as independent variables in classifying truth and deception. Though the predicted
values for perceptions performed comparably to human-coded perceptions in detecting
deception, the results were not satisfying. Thus, the extracted multimodal features
were used to predict deception directly. Classification accuracy was substantially
higher than typical human deception detection performance. Through this research,
we consider the feasibility and validity of the approach and identify how such an
approach could contribute to the broader community.