General Framework for image captioning (Task 1):
1. Take screen shot every 30 sec,
2. pause video and do image captioning
3. Insert before/after every frame of image
General Framework for video captioning:
1. Camera Shot segmentation using optical flow
2. verify the length of captioning = length of video - (Task 4) length of conversation
3. (Task 2) Do video translation with Convolutional LSTM
4. (Task3) Do length constrained video translation with Convolutional LSTM + Special loss function
3. (Task 2) Pause video and Insert captioning TTS before/after the video shot
4. (Task 3 - Task 4) Synch TTS / emotional TTS with video
Baseline 0: No audio description, dry video
Test 1: ASych Image captioning description every 30 second, pause and insert.
Every 30 Second capture the current screen of video, do image captioning and put it beck, pause while playing
Test 2: ASych video captioning based audio description, pause and insert
Test 2.1: Cut video between every scene, do video captioning and put it before the scene has been played. Pause while playing.
Test 2.2: Cut video between every scene, do video captioning and put it after the scene has been played. Pause while playing.
Test 3: Sych video captioning based audio description, on another sound channel
Test 4: Sych video captioning based audio description avoiding significant music and conversation, merged with original audio
Test 4.1: Netural TTS
Test 4.2: Emotional TTS (Not sure where to find one?)
Comments