urjnasw xkfjjkn's hot blog

2013年4月22日星期一

Envisioning Sketch Recognition: A Local Feature Based Approach to Recognizing Informal Sketches


Envisioning Sketch Recognition: A Local Feature Based Approach to Recognizing Informal Sketches

---By Michael Oltmans

Introduction:

This paper introduces the shape recognition approach based on the ink on the page (visual approach sketch recognition).

Challenges:

Since user-drawn sketches are informal, two people can have completely different writing styles. This introduces several big challenges in information sketch recognition field, such as overtracing, signal noises (variations that users do not intend), conceptual variations (A variety of ways some symbols can be drawn), segmentation (grouping strokes that correspond to a single shape) etc.

Isolated Shape Classification:

Similar to geometric semantic primitive part recognition, the classification approach here is also based on primitive parts. The only difference is we will focus on visual part instead of geometric semantic part of shapes. A visual part is an abstraction of the appearance of a region of the shape. It does not correspond to the semantic part of a shape (one of the lines in rectangle). It represents the appearance of a region of a shape (the top right hand corner of a rectangle).
Each visual part is a circular region that is subdivided into rings and wedges, like a dartboard. Each of the subdivisions of the circular region is a histogram bin that measures how many pixels of ink are contained on that region. The part is represented as a vector formed from the count in each bin. The good point of this presentation is it allows us to represent a wide variety of shapes, including those cannot be easily described with geometric primitives.
To form the representation, they first calculate the visual parts at a sampling of locations that cover the sketched symbol.
Next step they make use of a standard vocabulary of parts. The standard vocabulary is used to describe each sketched symbol in terms of the same set of parts. The vocabulary of parts is called the codebook, and generally contains 100 different parts. The representation of the sketched symbol is a vector of distances that indicates the degree to which each of the codebook parts appears in the sketched symbol. This vector of distances is called the match vector. It is calculated by finding the minimum distance in each row of the matrix.

Summary of Shape Localization:

This is accomplished in three steps:
(1)    Scan the input sketch to identify a large number of candidate locations. These locations may contain a shape, a part of a shape, parts of multiple shapes, or nothing at all.
(2)    The second step is to apply the classifier to each of the candidate location.
(3)    The third step is to sort through this candidates to produce a final set of predictions of where shapes are in the sketch

Representation of Visual Parts:

For more accurate lining up between two shapes, we represents the shapes as a collection of “parts” where each part represents the appearance of a portion of the shape. We then base our classification of the shape on the parts that it is composed of.

Bulls-eye Feature:

Bull-eyes features are calculated at a point and represent the appearance of a circular region surrounding that point. The central point of the feature is called the interest point. The circular region surrounding the interest point is divided into wedges, like a dartboard, by a series of concentric rings and pie slices. Each of the wedges in the circular region is a histogram bin that counts the number of ink points contained inside it. The appearance of ink within the region is represented by a vector containing the point count for each bin.
As can be clearly seen from the feature diagram, the radial bins are not evenly spaced. The inner rings are denser represents more detailed information about the appearance compared with the outer rings, which represents the “context” for inner rings. This is useful for our goals of suppressing signal noise and being able to represent a wide range of appearances.


Strokes Have Direction:

Inputs to system are still digital sketches. We focus on the trajectory of a stroke rather than the sequence of strokes. Therefore, there will be no constraints on the order users draw the components of each shape.

Making Features Rotational Invariant:

By calculating the bulls-eye’s histogram relative to the stroke direction, instead of relative to the x-axis, the bulls-eyes are rotationally invariant and do not change when the shape is drawn at a different orientation.
Two details must be dealt with to achieve true invariance.
(1)    Identical looking strokes can be drawn from either end.
(2)  The inherent noise in stroke trajectories results in the points along the stroke falling haphazardly onto either side of this boundary. This effect is eliminated by orienting the bullseye to the interest point's orientation and then rotating it by an additional amount equivalent to half a bin's angular width.
The stroke direction can be added as a third dimension to the histogram. It provides a count of how many points fall into each spatial bin at each direction.

How to calculate stroke direction:

To calculate the direction of the pen at a point along the stroke, we find the tangent of the stroke at that point. To deal with error caused by imprecision of pen movement, we deal with the direction of pen at a point in the following way: Calculating the tangent by fitting a line to a window of points immediately preceding and following the point. We find calculating the tangent over a window of 19 points generally provides good result.

Stroke preprocessing:

Size normalization: individual points based normalization: scale the shape so that the maximum distance between any two points in the shape is 75 pixels.
Resampling: Ensure the points on the stroke are constantly spaced. If there exits two points, whose distance is larger than 1 pixel, we will interpolate additional sample points between them.

Calculating distance between Bulls-eyes:

The distance between two bulls-eyes are calculated by comparing their vectors of bin counts.

Representation of Shapes:

What is codebook?

---It is the standard vocabulary of parts. It is derived from training set. We select a set of parts that span the range of parts that appear in the training set. Then we calculate the collection of bulls-eyes parts for each shape in our training set. After that, cluster them into groups of similar parts. Finally, form a codebook by using a representative from each cluster center.

What is match vector?

---A kind of representation that indicates the degree to which each of the codebook parts appears in the input shape.

Recognition:

Use the representation of Class label plus Matched Class Probability for each shape.

Steps:

Selecting candidate regions, classifying the candidate regions, and combining the classified regions to form the final predictions.
After clustering the initial candidates to form the intermediate candidates, there may still be predictions that overlap one another. To make a final set of predictions we follow a greedy strategy based on the total weight of each cluster. The final set of predictions is made by first selecting the highest scoring prediction and then removing any remaining predictions that significantly overlap the covered by that prediction.

Bibliography:

http://rationale.csail.mit.edu/publications/Oltmans2007Envisioning.pdf

A Visual Approach to Sketched Symbol Recognition


A Visual Approach to Sketched Symbol Recognition

Introduction:

The primary contributions of this paper is proposing an original way to recognize freehand sketches according to their visual appearance in contrast with individual strokes or geometric primitives combining temporal and spatial information. Another contribution is the creation of a new symbol rotation-deformation invariant classifier.

Approach:

Unlike pure off-line shape recognition, the extra information about the temporal nature of the strokes are used as well.

Procedure:

(1)    Symbol Normalization:
Resampling, Scale and Translation. The resampling approach is to assign points according to fixed distance interval. Scale horizontally and vertically until the shape has a unit standard deviation in both axes. Translate the mass center of shapes into point (0,0).
(2)    Feature Representation:
Four orientation features (0,45,90,135 degrees) measures how horizontal, vertical or diagonal the stroke is at each point. If stroke angle == reference angle, the value is 1. If they differs more than 45 degree, the value is 0.
The fifth feature is endpoint feature testing whether the starting and ending stroke points are overlapping. If they are, the feature value is 1, else it is 0.
The authors use five 24*24 grids to present the five features. This grids can be thought as feature images, in which the intensity of a pixel is determined by the maximum feature value of the sample points that fall within its cell. For example, for horizontal stroke, the intensity of 0-orientation image is high.
(3)    Smoothing and Downsampling:
To make freehand sketches more tolerant to local shifts and distortions, the authors apply Gaussian smoothing function to each image that spreads feature values to neighboring pixels.
Then downsampling the images by a factor of 2 using MAX filter, where each pixel in the downsized image is the maximum of the four corresponding pixels in the original.
(4)    Recognition:
Image Deformation Model (DFM) allows every point in the input image to shift within a 3*3 local window to form the best match to the prototype image. To avoid overfitting, the authors include the local context around each point, shifting 3*3 image patches instead of single pixels.

(5)    Performance Optimization:
      Coarse Candidate Pruning:
Indexing images using their first K principle components. Then using the distance between these reduced feature sets to find the nearest candidates.

Hierarchical Clustering:
Applying agglomerative hierarchical clustering to the training examples in each class. Then organizing them into groups based on complete-link distance. The process first initializes each symbol into its own cluster, then progressively merges the two nearest clusters until there is only one cluster per class. At each step, it records the two sub-clusters that are merged to form the present.


Rotational Invariance:
Generating and matching rotated versions of the input symbol to each of the training examples. In the implementation, 32 evenly spaced orientations from 0 to 360 degrees are used.


Bibliography:

http://www.aaai.org/ocs/index.php/IJCAI/IJCAI-09/paper/viewFile/281/918