Tony Feng

M.Sc. in Computer Science

Brown University

Computer Vision

[Computer Vision] Visual Bag of Words

Authored by Tony Feng

Created on Mar 9th, 2022

Last Modified on Mar 9th, 2022

Intro

This sereis of posts contains a summary of materials and readings from the course CSCI 1430 Computer Vision that I’ve taken @ Brown University. This course covers the topics of fundamentals of image formation, camera imaging geometry, feature detection and matching, stereo, motion estimation and tracking, image classification, scene understanding, and deep learning with neural networks. I posted these “Notes” (what I’ve learnt) for study and review only.

History of Recognition

Geometric Data

1960s – early 1990s
Camera Position Illumination
Recognition as an alignment problem
- e.g. fitting a model to a transformation between feature pairs
Recognition by components

Appearance-based Models

Sliding Window Approaches

Mid 1990s
sliding window + image pyramid $\rightarrow$ scale + location

Local Features

Late 1990s
Local features for object instance recognition
Large-scale Image Search

Parts-and-shape Models

Early 2000s
Model
- Objects as a set of parts
- Relative locations between parts
- Appearance of part
Constellation Models
Pictorial Structure Model

Bags of Features

Mid-2000s
Origins
- Texture Recognition
- Bag-of-words models

Bags of Features

It works pretty well for image-level classification and for recognizing object instances.

Steps

Feature extraction
- Regular Grids
- Interest Regions
- …

Form a “visual vocabulary”

Quantize features using visual vocabulary

Learn the visual vocabulary

Issues

How to choose the size of the visual vocbulary?
- Too small: features are not representative
- Too large: overfitting
Computational efficiency

Spatial Pyramid Matching

Color Histogram

All of these images have the same color histogram. How can we encode the spatial layout?

Pyramids

Pyramid is built by using multiple copies of image.
Each level in the pyramid is $\frac{1}{4}$ of the size of previous level.
The lowest level is of the highest resolution.
The highest level is of the lowest resolution.