Ch12. Recognition

포스트 : 2022.12.20.

최근 수정 : 2022.12.20.

What Matters in Recognition?

Learning Techniques

E.g. choice of classifier or inference method

Representation

Low level: SIFT, HoG, gist, edges
Mid level: Bag of words, sliding window, deformable model
High level: Contextual information

Data

More is always better
Annotation is the difficult part

Video Google System

Collect all words within query region
Inverted file index to find relevant frames
Compare word counts
Spatial verification

Simple idea

See how many keypoints are close to keypoints in each other image

⇒ slow

Indexing local features

When we see close points in feature space, we have similar descriptors, which indicates similar local content.

Visual words

Map high-dimensional descriptors to tokens/words by quantizing the feature space

Quantize via clustering, let cluster centers be the prototype words
Determine which word to assign to each new image region by finding the closest cluster center

Untitled

Issues:

Vocabulary size, number of words
Sampling strategy : grid or interest points
Clustering / quantization algorithm
Unsupervised vs. supervised
What corpus provides features (universal vocabulary?)

Inverted file index

Untitled

Instance recognition

remaining issues

How to summarize the content of an entire image? And gauge overall similarity?

Bags of visual words
- Summarize entire image based on its distribution (histogram) of word occurrences.
- Comparing bags of words : nearest neighbor search
- Inverted file index and bags of words similarity
  1. Extract words in query
  2. Inverted File index to find relevant frames
  3. Compare word counts
How large should the vocabulary be? How to perform quantization efficiently?

Recognition with K-tree

Use Maximally Stable Extremal Regions by Matas et al. and then extract SIFT descriptors from the MSER regions.
- Run k(branch-factor of the tree / 3 here)-means on the descriptor space
- Run k-means again, recursively on each of the resulting quantization cells.
- This defines the vocabulary tree, which is essentially a hierarchical set of cluster centers and their corresponding Voronoi regions.
- We typically use a branch-factor of 10 and six levels, resulting in a million leaf nodes
- In order to add an image to the database, we perform feature extraction. Each descriptor vector is now dropped down from the root of the tree and quantized very efficiently into a path down the tree, encoded by a single integer.
- Each node in the vocabulary tree has an associated inverted file index.
- Number of words given tree parameters
  - branching factor
  - number of levels
  - Higher branch factor works better (but slower)
- Word assignment cost vs. flat vocabulary
- sampling strategies
  - sparse, at interest points
  - dense, uniformly
  - randomly
  - multiple interest operators
Is having the same set of visual words enough to identify the object/scene? How to verify spatial agreement?
- with spatial information, first one is better
- without, there is no difference
Spatial Verification
How to score the retrieval results?

💡 $\text{Precision} = {\text{TP}\over\text{TP + FP}} = {\text{relevant}\over\text{returned}}$ $\text{Recall} = {\text{TP}\over\text{TP + FN}} = {\text{relevant}\over\text{total relevant}}$

tf-idf weighting

Tern frequency - inverse document frequency

Query Expansion

query → results → spatial verification → new query → new results

Keys to efficiency

Simple approach : Keypoint matching

All-pairs local feature matching is slow ⇒ quantize features and build bag of feature representation. ⇒ Lossy ⇒ spatial verification can help
Finding the overlap in visual words based on the Bags of Features is still too slow ⇒ interted file index, one lookup per word
Even quantizing the local features into a visual word is too slow ⇒ vocabulary tree. ⇒ Lossy

abs(YES)