
Premium content
Access to this content requires a subscription. You must be a premium user to view this content.

workshop paper
Learning Human Action Representations from Temporal Context in Lifestyle Vlogs
keywords:
vlogs
co-occurence
human actions
multimodal
social media
We address the task of human action representation and show how the approach to generating word representations based on co-occurrence can be adapted to generate human action representations by analyzing their co-occurrence in videos. To this end, we formalize the new task of human action co-occurrence identification in online videos, i.e., determine whether two human actions are likely to co-occur in the same interval of time. We create and make publicly available the Co-Act (Action Co-occurrence) dataset, consisting of a large graph of ~12k co-occurring pairs of visual actions and their corresponding video clips. We describe graph link prediction models that leverage visual and textual information to automatically infer if two actions are co-occurring. We show that graphs are particularly well suited to capture relations between human actions, and the learned graph representations are effective for our task and capture novel and relevant information across different data domains.