1904.04882.md - hassony2/inria-research-wiki GitHub Wiki

Supreeth Narasimhaswamy†, Zhengwei Wei†, Yang Wang, Justin Zhang, Minh Hoai

Perform hand detection in the wild (but looks like mostly in 3rd person view)

Release datasets for this task

TV-Hand dataset
- from ActionThread dataset (4757 videos: 1521 train/1514 test)
- includes one to two frames from each video. Training data contains images from 2433 videos, validation from 810 videos, test from 1514 videos.
  - 9498 images: 4853 training/ 1618 validation / 3027 test
  - number of hands : 4085 train / 1362 validation / 3199 test
- images height : 360 pixels
Oxford dataset
- Automatically annotate a subset of Microsoft’s COCO dataset
- 26,499 images with 45,671 hands
- COCO-Hand-S final verification step to identify images with good and complete annotations, 4534 images with 10,845 hands

Undoubtedly, the availability of this large-scale dataset is one reason for the impressive performance of our hand detector.

Improvements upon mask-rcnn baseline on 2 datasets: Oxford-Hand: 69.9% --> 73.0% TV-Hand: 59.9% --> 60.3%

So baseline is pretty strong and might be faster ?