Simulating and optimizing for behavioural aspects of video/image content, such as memorability and rewatchability, using a large language model. The first work to embed content and the elicited human response in the same space. Successfully embedded vision into a Vicuna‐13B LLM and instruction fine‐tuned it to understand the relationship between human behavior and video content. Beat few‐shot GPT‐4, showing that current SoTA models do not understand behavior.
Automated scraping and processing of terabytes of data for multiple projects.