Envisioning Sketch
Recognition: A Local Feature Based Approach to Recognizing Informal Sketches
---By Michael Oltmans
Introduction:
This
paper introduces the shape recognition approach based on the ink on the page
(visual approach sketch recognition).
Challenges:
Since
user-drawn sketches are informal, two people can have completely different
writing styles. This introduces several big challenges in information sketch
recognition field, such as overtracing, signal noises (variations that users do
not intend), conceptual variations (A variety of ways some symbols can be
drawn), segmentation (grouping strokes that correspond to a single shape) etc.
Isolated Shape
Classification:
Similar
to geometric semantic primitive part recognition, the classification approach
here is also based on primitive parts. The only difference is we will focus on
visual part instead of geometric semantic part of shapes. A visual part is an
abstraction of the appearance of a region of the shape. It does not correspond
to the semantic part of a shape (one of the lines in rectangle). It represents
the appearance of a region of a shape (the top right hand corner of a rectangle).
Each
visual part is a circular region that is subdivided into rings and wedges, like
a dartboard. Each of the subdivisions of the circular region is a histogram bin
that measures how many pixels of ink are contained on that region. The part is
represented as a vector formed from the count in each bin. The good point of
this presentation is it allows us to represent a wide variety of shapes,
including those cannot be easily described with geometric primitives.
To
form the representation, they first calculate the visual parts at a sampling of
locations that cover the sketched symbol.
Next
step they make use of a standard vocabulary of parts. The standard vocabulary
is used to describe each sketched symbol in terms of the same set of parts. The
vocabulary of parts is called the codebook, and generally contains 100
different parts. The representation of the sketched symbol is a vector of
distances that indicates the degree to which each of the codebook parts appears
in the sketched symbol. This vector of distances is called the match vector. It
is calculated by finding the minimum distance in each row of the matrix.
Summary of Shape
Localization:
This
is accomplished in three steps:
(1)
Scan the input sketch to identify a large number
of candidate locations. These locations may contain a shape, a part of a shape,
parts of multiple shapes, or nothing at all.
(2)
The second step is to apply the classifier to
each of the candidate location.
(3)
The third step is to sort through this
candidates to produce a final set of predictions of where shapes are in the
sketch
Representation of Visual Parts:
For more accurate lining up between two shapes, we
represents the shapes as a collection of “parts” where each part represents the
appearance of a portion of the shape. We then base our classification of the
shape on the parts that it is composed of.
Bulls-eye Feature:
Bull-eyes features are calculated at a point and
represent the appearance of a circular region surrounding that point. The
central point of the feature is called the interest point. The circular region
surrounding the interest point is divided into wedges, like a dartboard, by a
series of concentric rings and pie slices. Each of the wedges in the circular
region is a histogram bin that counts the number of ink points contained inside
it. The appearance of ink within the region is represented by a vector
containing the point count for each bin.
As can be clearly seen from the feature diagram, the
radial bins are not evenly spaced. The inner rings are denser represents more
detailed information about the appearance compared with the outer rings, which
represents the “context” for inner rings. This is useful for our goals of
suppressing signal noise and being able to represent a wide range of
appearances.
Strokes Have Direction:
Inputs
to system are still digital sketches. We focus on the trajectory of a stroke
rather than the sequence of strokes. Therefore, there will be no constraints on
the order users draw the components of each shape.
Making Features Rotational Invariant:
By
calculating the bulls-eye’s histogram relative to the stroke direction, instead
of relative to the x-axis, the bulls-eyes are rotationally invariant and do not
change when the shape is drawn at a different orientation.
Two
details must be dealt with to achieve true invariance.
(1)
Identical looking strokes can be drawn from
either end.
(2) The inherent noise in stroke trajectories
results in the points along the stroke falling haphazardly onto either side of
this boundary. This effect is eliminated by orienting the bullseye to the
interest point's orientation and then rotating it by an additional amount
equivalent to half a bin's angular width.
The
stroke direction can be added as a third dimension to the histogram. It
provides a count of how many points fall into each spatial bin at each direction.
How to calculate stroke direction:
To
calculate the direction of the pen at a point along the stroke, we find the
tangent of the stroke at that point. To deal with error caused by imprecision
of pen movement, we deal with the direction of pen at a point in the following
way: Calculating the tangent by fitting a line to a window of points
immediately preceding and following the point. We find calculating the tangent
over a window of 19 points generally provides good result.
Stroke
preprocessing:
Size
normalization: individual points based normalization: scale the shape so that
the maximum distance between any two points in the shape is 75 pixels.
Resampling:
Ensure the points on the stroke are constantly spaced. If there exits two
points, whose distance is larger than 1 pixel, we will interpolate additional
sample points between them.
Calculating
distance between Bulls-eyes:
Representation of Shapes:
What is codebook?
---It
is the standard vocabulary of parts. It is derived from training set. We select
a set of parts that span the range of parts that appear in the training set.
Then we calculate the collection of bulls-eyes parts for each shape in our
training set. After that, cluster them into groups of similar parts. Finally,
form a codebook by using a representative from each cluster center.
What is match vector?
---A
kind of representation that indicates the degree to which each of the codebook
parts appears in the input shape.
Recognition:
Use
the representation of Class label plus Matched Class Probability for each
shape.
Steps:
Selecting
candidate regions, classifying the candidate regions, and combining the
classified regions to form the final predictions.
After
clustering the initial candidates to form the intermediate candidates, there
may still be predictions that overlap one another. To make a final set of
predictions we follow a greedy strategy based on the total weight of each
cluster. The final set of predictions is made by first selecting the highest
scoring prediction and then removing any remaining predictions that
significantly overlap the covered by that prediction.
Bibliography:
http://rationale.csail.mit.edu/publications/Oltmans2007Envisioning.pdf