urjnasw xkfjjkn's hot blog

2013年4月22日星期一

Envisioning Sketch Recognition: A Local Feature Based Approach to Recognizing Informal Sketches


Envisioning Sketch Recognition: A Local Feature Based Approach to Recognizing Informal Sketches

---By Michael Oltmans

Introduction:

This paper introduces the shape recognition approach based on the ink on the page (visual approach sketch recognition).

Challenges:

Since user-drawn sketches are informal, two people can have completely different writing styles. This introduces several big challenges in information sketch recognition field, such as overtracing, signal noises (variations that users do not intend), conceptual variations (A variety of ways some symbols can be drawn), segmentation (grouping strokes that correspond to a single shape) etc.

Isolated Shape Classification:

Similar to geometric semantic primitive part recognition, the classification approach here is also based on primitive parts. The only difference is we will focus on visual part instead of geometric semantic part of shapes. A visual part is an abstraction of the appearance of a region of the shape. It does not correspond to the semantic part of a shape (one of the lines in rectangle). It represents the appearance of a region of a shape (the top right hand corner of a rectangle).
Each visual part is a circular region that is subdivided into rings and wedges, like a dartboard. Each of the subdivisions of the circular region is a histogram bin that measures how many pixels of ink are contained on that region. The part is represented as a vector formed from the count in each bin. The good point of this presentation is it allows us to represent a wide variety of shapes, including those cannot be easily described with geometric primitives.
To form the representation, they first calculate the visual parts at a sampling of locations that cover the sketched symbol.
Next step they make use of a standard vocabulary of parts. The standard vocabulary is used to describe each sketched symbol in terms of the same set of parts. The vocabulary of parts is called the codebook, and generally contains 100 different parts. The representation of the sketched symbol is a vector of distances that indicates the degree to which each of the codebook parts appears in the sketched symbol. This vector of distances is called the match vector. It is calculated by finding the minimum distance in each row of the matrix.

Summary of Shape Localization:

This is accomplished in three steps:
(1)    Scan the input sketch to identify a large number of candidate locations. These locations may contain a shape, a part of a shape, parts of multiple shapes, or nothing at all.
(2)    The second step is to apply the classifier to each of the candidate location.
(3)    The third step is to sort through this candidates to produce a final set of predictions of where shapes are in the sketch

Representation of Visual Parts:

For more accurate lining up between two shapes, we represents the shapes as a collection of “parts” where each part represents the appearance of a portion of the shape. We then base our classification of the shape on the parts that it is composed of.

Bulls-eye Feature:

Bull-eyes features are calculated at a point and represent the appearance of a circular region surrounding that point. The central point of the feature is called the interest point. The circular region surrounding the interest point is divided into wedges, like a dartboard, by a series of concentric rings and pie slices. Each of the wedges in the circular region is a histogram bin that counts the number of ink points contained inside it. The appearance of ink within the region is represented by a vector containing the point count for each bin.
As can be clearly seen from the feature diagram, the radial bins are not evenly spaced. The inner rings are denser represents more detailed information about the appearance compared with the outer rings, which represents the “context” for inner rings. This is useful for our goals of suppressing signal noise and being able to represent a wide range of appearances.


Strokes Have Direction:

Inputs to system are still digital sketches. We focus on the trajectory of a stroke rather than the sequence of strokes. Therefore, there will be no constraints on the order users draw the components of each shape.

Making Features Rotational Invariant:

By calculating the bulls-eye’s histogram relative to the stroke direction, instead of relative to the x-axis, the bulls-eyes are rotationally invariant and do not change when the shape is drawn at a different orientation.
Two details must be dealt with to achieve true invariance.
(1)    Identical looking strokes can be drawn from either end.
(2)  The inherent noise in stroke trajectories results in the points along the stroke falling haphazardly onto either side of this boundary. This effect is eliminated by orienting the bullseye to the interest point's orientation and then rotating it by an additional amount equivalent to half a bin's angular width.
The stroke direction can be added as a third dimension to the histogram. It provides a count of how many points fall into each spatial bin at each direction.

How to calculate stroke direction:

To calculate the direction of the pen at a point along the stroke, we find the tangent of the stroke at that point. To deal with error caused by imprecision of pen movement, we deal with the direction of pen at a point in the following way: Calculating the tangent by fitting a line to a window of points immediately preceding and following the point. We find calculating the tangent over a window of 19 points generally provides good result.

Stroke preprocessing:

Size normalization: individual points based normalization: scale the shape so that the maximum distance between any two points in the shape is 75 pixels.
Resampling: Ensure the points on the stroke are constantly spaced. If there exits two points, whose distance is larger than 1 pixel, we will interpolate additional sample points between them.

Calculating distance between Bulls-eyes:

The distance between two bulls-eyes are calculated by comparing their vectors of bin counts.

Representation of Shapes:

What is codebook?

---It is the standard vocabulary of parts. It is derived from training set. We select a set of parts that span the range of parts that appear in the training set. Then we calculate the collection of bulls-eyes parts for each shape in our training set. After that, cluster them into groups of similar parts. Finally, form a codebook by using a representative from each cluster center.

What is match vector?

---A kind of representation that indicates the degree to which each of the codebook parts appears in the input shape.

Recognition:

Use the representation of Class label plus Matched Class Probability for each shape.

Steps:

Selecting candidate regions, classifying the candidate regions, and combining the classified regions to form the final predictions.
After clustering the initial candidates to form the intermediate candidates, there may still be predictions that overlap one another. To make a final set of predictions we follow a greedy strategy based on the total weight of each cluster. The final set of predictions is made by first selecting the highest scoring prediction and then removing any remaining predictions that significantly overlap the covered by that prediction.

Bibliography:

http://rationale.csail.mit.edu/publications/Oltmans2007Envisioning.pdf

A Visual Approach to Sketched Symbol Recognition


A Visual Approach to Sketched Symbol Recognition

Introduction:

The primary contributions of this paper is proposing an original way to recognize freehand sketches according to their visual appearance in contrast with individual strokes or geometric primitives combining temporal and spatial information. Another contribution is the creation of a new symbol rotation-deformation invariant classifier.

Approach:

Unlike pure off-line shape recognition, the extra information about the temporal nature of the strokes are used as well.

Procedure:

(1)    Symbol Normalization:
Resampling, Scale and Translation. The resampling approach is to assign points according to fixed distance interval. Scale horizontally and vertically until the shape has a unit standard deviation in both axes. Translate the mass center of shapes into point (0,0).
(2)    Feature Representation:
Four orientation features (0,45,90,135 degrees) measures how horizontal, vertical or diagonal the stroke is at each point. If stroke angle == reference angle, the value is 1. If they differs more than 45 degree, the value is 0.
The fifth feature is endpoint feature testing whether the starting and ending stroke points are overlapping. If they are, the feature value is 1, else it is 0.
The authors use five 24*24 grids to present the five features. This grids can be thought as feature images, in which the intensity of a pixel is determined by the maximum feature value of the sample points that fall within its cell. For example, for horizontal stroke, the intensity of 0-orientation image is high.
(3)    Smoothing and Downsampling:
To make freehand sketches more tolerant to local shifts and distortions, the authors apply Gaussian smoothing function to each image that spreads feature values to neighboring pixels.
Then downsampling the images by a factor of 2 using MAX filter, where each pixel in the downsized image is the maximum of the four corresponding pixels in the original.
(4)    Recognition:
Image Deformation Model (DFM) allows every point in the input image to shift within a 3*3 local window to form the best match to the prototype image. To avoid overfitting, the authors include the local context around each point, shifting 3*3 image patches instead of single pixels.

(5)    Performance Optimization:
      Coarse Candidate Pruning:
Indexing images using their first K principle components. Then using the distance between these reduced feature sets to find the nearest candidates.

Hierarchical Clustering:
Applying agglomerative hierarchical clustering to the training examples in each class. Then organizing them into groups based on complete-link distance. The process first initializes each symbol into its own cluster, then progressively merges the two nearest clusters until there is only one cluster per class. At each step, it records the two sub-clusters that are merged to form the present.


Rotational Invariance:
Generating and matching rotated versions of the input symbol to each of the training examples. In the implementation, 32 evenly spaced orientations from 0 to 360 degrees are used.


Bibliography:

http://www.aaai.org/ocs/index.php/IJCAI/IJCAI-09/paper/viewFile/281/918

2013年4月11日星期四

ShortStraw-A Simple and Effective Corner Finder for Polylines


ShortStraw-A Simple and Effective Corner Finder for Polylines
Summarized by urjnasw xkfjjkn
This paper introduces a polylines corner finding algorithm called ShortStraw. It is simple to implement and outperforms the state-of-the-art corner finding algorithm for polylines.
The implementation is comprised of the following two basic steps:
(1)    Resampling:
Unlike traditional resampling using fixed time period to determine a resampling point, ShortStraw uses fixed distance to determine when to insert a point to original graph. The fixed distance is  (size of bounding box / 40 ) empirically. Specifically speaking, resampling point set is generated according to the following algorithm:
The termination condition for this algorithm is when i > |number of points in original graph|.
(2)    Corner Finding:
There are two approaches for corner finding:
[1] Bottom-up:
First of all, the concept of straw should be clarified. Straw means the Euclidean distance between point Pi+w and Pi-w, where w is window size. Empirically it should be 3 for higher corner recognition rate. After summing up the straw of each point in the original stroke and take the average, we will get a straw median. 0.95 * straw median is used as a threshold. When the straw length of a point is less than the threshold, we take the local minimum as a corner.
[2] Top-down:
Top-down is run after bottom-up. It is used to find missing corners and rejecting false positives.
After running Bottom-up, we have already got a bunch of corners. For each consecutive corners, we run IS-Line test:

Basically, this is the simplest line test ever. If it is a line, r equals to 1. Otherwise, r should be less than 1. Here, we set r equals 0.95 as threshold. When r between two consecutive corners is less than the threshold, we need to relax the threshold to find missing corner around the middle half of the segment. This process is repeated until r is larger than the threshold.
On the other hand, we need to do collinear check on three consecutive corners cl, cm, cn. If cl and cn pass the IS-Line test mentioned above, just remove cm. What should be emphasized is all returned corners should be from resampling points obtained in the first step.
Pros and Cons for ShortStraw:
Pros: Compared with other common corner recognition techniques, such as Sezgin, Kim, ShortStraw is quicker, easier to implement and more accurate. It does not need any temporal information.
Cons: Only applicable to stroke consisting only polylines, cannot handle arcs, not sensitive to obtuse angles. 

Bibliography:

A. Wolin, B. Eoff, and T. Hammond. 2008. ShortStraw: a simple and effective corner finder for polylines. In Proceedings of the Fifth Eurographics conference on Sketch-Based Interfaces and Modeling (SBM'08), Christine Alvarado and Marie-Paule Cani (Eds.). Eurographics Association, Aire-la-Ville, Switzerland, Switzerland, 33-40.
https://diglib.eg.org/EG/DL/WS/SBM/SBM08/033-040.pdf.abstract.pdf%3binternal&action=action.digitallibrary.ShowPaperAbstract

2013年4月9日星期二

Sketch Recognition Algorithms for Comparing Complex and Unpredictable Shapes


Sketch Recognition Algorithms for Comparing Complex and Unpredictable Shapes

This paper introduces a Mechanix---a computer-assisted tutoring system for engineering students. It can recognize freehand sketches and provide instant feedbacks. Mechanix will release the burden of grading so that instructors can assign more free response questions in homework. Quite interesting!
Firstly, let me clarify the concept of “body” in this paper. Any stroke or set of strokes that forms a closed shape as a possible “body”. To correctly recognize the “bodies”, we need the following steps:

Body Identification:

To check whether a collection of segments form a “body” (closed shape), we just pick a random starting point from one of the segments. For the other end of the segment, we try to find out the closest end point of another segment in the collection. If the distance between this two endpoints are within the threshold of 9% of the total length of the strokes (segments in the collection), then we predict the two strokes connect with each other. This simple process repeats again and again until the end point in the last segment connects back to the starting point of the first segment. If it is, we predict the collection of segments as a “body”.

Body Comparison:

One big difference between “body comparison” and “instance-based recognition” is only one template will be used to generate a binary “match/no match” classification.
Three key metrics to determine similarity of a body and a template are Hausdorff distance, modified Hausdorff distance and Tanimoto coefficient.

Hausdorff distance:
PA is the vector of points in template. PB is the vector of points in “body”. Hausdorff distance is to calculate the maximum of the maximum D from A to B and B to A.

Modified Hausdorff distance:



It is the average of the averages of DA and DB.
To determine whether the “body” matches with the reference (template), the authors introduces a “match probability”:

where 20 is the half of the bounding box size.

Tanimoto coefficient:


where nAB is the number of overlapping pixels. nA,nB are the total number of pixels in A and B respectively.
For easier computation, we usually use a lax definition of overlapping. That means we deal with points rather than pixels any more.
where nAB is the number of overlapping points from A to B (Here overlapping means the distance between the two points is no more than 10% of the bounding box size)
T(A,B) represents the overall percentage of overlap between two shapes.

About Trusses:

Trusses are important structure for engineers because of their comprehensive applications, such as bridges, airplane wings, building supporting roofs, etc.

Trusses identification:

We first use the PaleoSketch algorithm to preprocess each stroke and identify all of the line segments. We use all of the line segments in a diagram to build a graph with each segment as a node, and edges between segments that intersect. Once we have built the connection graph, we consider each edge AB as a potentially shared edge. We remove the AB edge from the graph and run a breadth-first search (BFS) to search for another path from A to B. If we do not find a path, then AB is not even part of a polygon, and so can’t be a shared edge. If we do find a path then AB is part of a polygon, and we remove all of the edges in that path from the graph and use BFS to search for a third path from A to B (AB being the first path). If we do not find a path, then AB is not a shared edge. If we do find a path, then the two paths we found with BFS each represent a polygon sharing the AB edge.

Trusses Comparison:

When comparing a student’s truss to the reference solution, we consider both the sketch and the adjacency graph. It is important that the student’s submission has the exact same graphical structure as the reference solution.

Bibliography:

http://dl.acm.org/citation.cfm?id=2283803&bnc=1





A Domain_Independent System for Sketch Recognition



A Domain_Independent System for Sketch Recognition

This paper primarily introduces a domain-independent sketch recognition system, which uses direction graph and curvature graph to identify primitive shapes, uses feature-area verification for double check and post-process to complete hierarchical output.
Let’s start from imprecise stroke approximation:
Initial stroke approximation is basically done with the help of direction graph and curvature graph.
These two graphs are obtained by analyzing the sampling points on the original stroke.

where dn is the direction between point (n+1) and point n. cn is the curvature between point (n-k) and (n+k). The nominator is the total direction change between point (n-k) and point (n+k), and the denominator is the path length between these two points.
Then Direction graph and Curvature graph can be obtained:
For example,


For the Star Stroke, there are four times change of direction. So basically, there are four big changes in direction graph.
Feature area is used for stroke verification. It works like this:
The shaded area is the difference between original stroke and reference stroke. When the difference is very small, it basically indicates that the reference shape is exactly the one indicated by the original stroke.
From now on, we will dive into specific cases on how to proceed with primitive shape detection:

For vertex detection:

When an input stroke cannot be approximated with any kind of primitive shape, we will divide the stroke into two sub-strokes. The dividing point will be the point with highest curvature value in current stroke.

For line detection:

Generally, there are two ways for line detection. The first one is to check whether the direction curve can be fitted with a horizontal line in direction graph. The second one is to check whether the actual coordinates of the stroke can be fitted by a line. Then we can use feature-area verification for double check, just as shown in figure 3(a)

For curve detection:

If a line with non-zero slope shows in direction graph, the stroke is likely to be a curve. If the slope is close to 2*pi/N, then it is likely to be a circle. If succeeds, we hypothesize a candidate circle which takes the center of the stroke’s bounding box as its own center and the mean distance between the center and each stroke point as its radius. If the difference between the stroke and candidate circle is within the reasonable threshold, we can identify the stroke as a circle.

For arc detection:

The algorithm for arc recognition is devised as follows:
(1) Check whether the direction curve can be approximated with a line, and abort if fails.
(2) Find the perpendicular bisector, denoted as M, of the first and last stroke points, calculate the intersection point A between line M and the stroke path, and then find the circle on which the first and last stroke points as well as point A reside. The circle’s center is taken as the center of
the candidate arc.
(3) Take the mean distance from each stroke point to the center as the radius.
(4) Compute the ratio between the corresponding feature area against the candidate arc and that arc’s standard area. Check whether the ratio is less than a predefined threshold, and abort if fails.
Handling self-intersection Strokes
Given a stroke which cannot be approximated by any of our predefined primitive shapes, if it has no self-intersection point, then follow the original procedure; otherwise, make two copies of that stroke, divide them using different strategies, i.e. from self-intersection points versus from the point with maximum curvature, then compare respective recognition result and choose the better one.

Post-Process:

In this stage, we need to do simple relation retrieval, cleanup and basic object recognition. Then everything is done, we can output our sketch recognition result.

Bibliography:

http://dl.acm.org/citation.cfm?id=604499


2013年3月27日星期三

A few things to learn about machine learning


A few things to learn about machine learning

(1)  Learning = Representation + Evaluation + Optimization
Representation: A classifier must be represented in some formal language that the computer
can handle. Hypothesis space: A set of classifiers that a learner can learn.
Evaluation: An objective function used to distinguish good classifiers from bad ones.
Optimization: Search among the classifiers in the language for the highest-scoring one.
(2)  It’s generalization that counts
The fundamental goal of machine learning is to generalize beyond the examples in the training
set. It is a huge mistake to train and test on the same data. But holding data will reduce the amount of
available for training. One solution to this contradiction is cross-validation: randomly dividing your
training data into (say) 10 subsets, holding out each one while training on the rest, testing each learned
classifier on the examples it did not see, and averaging the results to see how well the particular
parameter setting does .
(3)  Data alone is not enough
Every learner must embody some knowledge or assumptions beyond the data it is given in order
to generalize beyond it. This is called induction. It turns a small amount of input knowledge into
a large amount of output knowledge.
A corollary of this is that one of the key criteria for choosing a representation is which kinds of
knowledge are easily expressed in it.
(4)  Overfitting has many faces
Overfitting problem occurs when we try to determine classifier with insufficient knowledge and
data.
One way to understand overfitting is by decomposing generalization error into bias and variance.
Bias is a learner’s tendency to consistently learn the same wrong thing. Variance is the tendency
to learn random things irrespective of the real signal.
(5)  Intuition Fails in high Dimensions
Another big problem after Overfitting is curse of dimensionality. Basically, it means generalizing
gets harder as your dimension s get larger. If you have 100 dimensions and 1 trillion training
samples, you only cover 10 to the negative eighteenth power of the possible combination of the
input space. That’s tiny! Sometimes more dimensions mean more noise.
(6)  Theoretical Guarantees are not What they seem
In machine learning field, the most common theoretical guarantee is a bound on the number of
examples needed to ensure good generalization. One of the major developments of recent
decades has been the realization that in fact we can have guarantees on the results of induction,
particularly if we are willing to settle for probabilistic guarantees.
(7)  Feature engineering is the Key
The most important factor is the features used. Learning is easy if you have many independent
features that each correlate well with the class.
(8)  A dumb algorithm with lots and lots of data beats a clever one with modest amounts of it.
The sufficiency of examples is necessary.
(9)  Learn many models, not Just one
The best learner varies from application to application. If instead of selecting the best variation
found, we combine many variations, the results are better—often much better—and at little
extra effort for the user. Bagging is an example of this, in which “we simply generate random
variations on the training set by resampling, learn a classifier on each, and combine the results
by voting. This works because it greatly reduces variance while only slightly increasing bias.”
(10)  Simplicity Does not imply Accuracy
Occam’s razor states that in machine learning, given two classifier with the same training error,
the simpler one has the lowest test error. However, “no free lunch” theorems imply it can-not
be true. Simplicity is a preference that is not related to accuracy.
(11)  Just because a function can be represented does not mean it can be learned
Don’t be lured in by claims that every problem can be representable by a certain algorithm. It
may be representable but not learnable given your training set.
(12)  Correlation Does Not Imply Causation
Correlation is potentially causation but should not be stated as such without proof

Bibliography:
Pedro Domingos, A few useful things to know about machine learning Communications of the ACM
CACM Homepage archive Volume 55 Issue 10, October 2012 Pages 78-87

2013年3月24日星期日

The Word-Gesture Keyboard: Reimaging Keyboard Interaction


Summary:
This paper primarily talks about Word-Gesture Keyboard, a creative HCI method through keyboard. It is on average considered more preferred, more fun, and less physically but more visually demanding. It is designed with following three aims in mind:
(1)        Fast input speed
(2)        Minimal recognition load on new users
(3)        Improved efficiency with users getting more familiar with Word-Gesture Keyboard
Word-Gesture Keyboard will allow user to write each and every word via a word gesture. Here, the “word” is not limited in lexicon. It can be tokens defined by arbitrary strings of characters, such as “Gmail”.
Word-Gesture Keyboard Feasibilities:
One big problem of Word-Gesture Keyboard is most word gestures will run across letters that are not part of the word intended. Fortunately, it can be solved with statistical regularities of natural language, which indicates some character sequences are more likely than others and most simply don’t exist as legitimate words. It implies that all valid letter combinations will form a finite set that can be captured in a language model. With this theory breakthrough, all possible words can be represented geometrically on a given key-board layout as word gestures and matched against users’ input gestures. Just as mentioned above, “word” here has a more generalized meaning. It can be rare names, jargons, email addresses, passwords, etc. We name this kinds of words Out of Vocabularies (OOV). In our Word-Gesture Keyboard, OOV letter sequences can always be entered by typing the individual letter keys. If these OOV sequences are frequently used then they may be added to the system’s list of recognized words, either manually or automatically. An addition to the solution is N-best suggestions. When “tip top” conflict occurs, they can be addressed by manual selection from the alternative N-best suggestions, or automatically according to word context.
Word-Gesture Keyboard Efficiency:
One Continuous Movement: In comparison to tapping-based touchscreen keyboards, gesture keyboards do not require up and down movements for each letter. Therefore, it is undoubtedly faster. The speed advantage of a single-stroke word gesture input can also be understood in motor control modeling terms. Tapping individual letters in a word can be viewed as a sequence of discrete target pointing tasks, each can be modeled by Fitts’ law (showed below).
tk,k + 1 is the time duration from tapping the kth letter (key) to the (k+ 1)th letter in the word; Dk,k + 1 is the movement distance from the kth letter to the (k+ 1) letter; and Sis the size of the target key. a and b are two constants of Fitts’ law. ID is called Fitts’ index of difficulty, measured in bits.
Experts’ conclusion is goal-crossing task is faster than tapping on the same sized targets as long as ID is less than 4 bits. Here, “goal” means a letter key needed in a word.
Auto word ending and spacing: each time a user lifts the finger from the touch surface, a word and a space are entered. Not having to enter a space character after each word is another efficiency advantage of a gesture keyboard.
Error-tolerance: Error tolerance allows the user to cut corners, to be inaccurate but fast.
One finger operation: This is the only aspect that Word-Gesture Keyword is not as good as two-handed typing. This is particularly true when the keyboard layout is the conventional QWERTY on which consecutive letters of a word tend to alternate between the left and right side of the keyboard. With two handed-typing, when one hand strikes one letter the other hand can, to some degree, move towards the next letter in parallel.
Word-Gesture Keyboard Ease of Use:
First, typing on a keyboard is a familiar text input method to most, if not all computer and smartphone users.
Second, drawing or doodling is a fun and easy action that even children enjoy doing.
Third, the user does not have to have learned any gestures before using a word-gesture keyboard.
Word-Gesture Keyboard Ease VS Efficiency:
The two types of behavior are two ends of a continuum. Our main behavioral theory of word shorthand gesture key-boards is that their use automatically shifts from the ease end (visual tracing) to the efficient end (recall gesturing).
Importantly, we do not expect the users to gesture every word without looking at the keyboard. Due to the Zipf’s law effect, a small number of words are used disproportionally frequently and their stroke patterns are memorized early. Longer and less common words are typically made of common fragments whose shapes can be quickly remembered. An important word-gesture key-board property is that it does not force the user into either “mode”. The user gradually progresses from the easy end to the more efficient end in use. In this sense, a word-gesture keyboard is a “progressive user interface.”
Word-Gesture Keyboard Gesture Recognitions:
where P(G|W) is the likelihood of W’s word gesture matching a user’s input gesture G, and P(W) reflects the system’s estimate of prior probability that the word W is the user’s intended word. The denominator P(G) only depends on the user’s gesture and is invariant during the search.
The search for the user’s intended word is thus the product of two model estimates. The probability P(G|W) reflects the gestural model and the probability P(W) reflects the language model.
In order to estimate P(G|W), we have used various techniques, such as dynamic time warping and template matching, to compute gesture keyboarding shape similarities.
Word-Gesture Keyboard Two Novel Functions:
(1)        Command Strokes:
With our systems, the user may issue commands (such as “Copy” and “Paste”) by tracing out the command names on the keyboard starting from a designated key (e.g. a Cmd key). The system suggests the command effect as soon as the command stroke is unambiguous.
(2)        Case Key:
We introduced a new key on the keyboard, the Case key (see the lower left corner of Figure 1). This key cycles through the different word case alternatives for the word just entered or preceding the text caret. The Case key uses dictionary information to intelligently support nonstandard casing convention for some words, such as “iPhone”. Since the Case key modifies the word preceding the current text
caret position (“reverse Polish”) it enables users to perform case corrections after the word is entered and only when they are actually needed.
Bibliography:
Zhai, Shumin, and Per Ola Kristensson. "The Word-gesture Keyboard: Reimagining Keyboard Interaction." Communications of the ACM 55.9 (2012): 91-101. ACM Digital Library. Web. 23 Mar. 2013. <http://dl.acm.org/citation.cfm?id=2330689>.

The blog content is created by urjnasw xkfjjkn (Xu Yan) on 23rd, March, 2013.

2013年3月2日星期六

Early Processing for Sketch Understanding


Sketch Based Interfaces: Early Processing for Sketch Understanding
Introduction:
This paper comes from the idea of utilizing the powerful processing capability of a computer to realize an easy and flexible UI for sketch recognition. In other words, this interface should give users a feeling as if they are drawing in a paper. To realize this target, the first problem is how to convert digitalized pen strokes into intended geometric objects. This is done by interpreting the pixels of user sketch and completing three-step processing.

Three-step processing:
(1)Approximation:
Aim: Match the most basic geometric primitives ( lines and curves ) to a given set of pixels.
Vertex detection:
Mark corners of sketch with vertices.
(1.1)How to determine vertices in sketch?
---Curvature: Change in direction with respect to arc length.
Those points with a maximum of the absolute value of curvature are vertices.
In Figure 1, the points in red circle are vertices
---Speed:
Those points with a minimum of speed are vertices.
In Figure 1, the points in green circle are vertices.


Figure 1 Direction, Curvature and Speed graph for Stroke
   To avoid the bad influence of noise in data, the authors introduce Average  Based Filtering. In Figure 2 and Figure 3, the red line is threshold. We are looking for maxima of curvature only where the curvature is already high and minima of speed only where the speed is already low. Thresholds are selected empirically. For Curvature Graph, it is the mean value. However, for Speed Graph, it is 90% of the mean value.
Figure 2 Speed Graph for Square with threshold
Figure 3 Curvature Graph for Square with threshold
(1.2)Generating Hybrid Fits:
       Computing vertex certainties:
      Our certainty metric for a curvature candidate vertex vi is the scaled magnitude of the curvature in a local neighborhood around the point, computed as |di−k − di+k|/L. Here L is curve length between points Si-k and Si+k. k is a small integer defining the neighborhood size around vi. The certainty metric for a speed fit candidate vertex vi is a measure of the pen slowdown at the point, 1 − vi/vmax. vmax is the maximum pen speed in the stroke. The     certainty values are normalized to [0, 1]. 
      Generating a set of hybrid fits:
      The initial hybrid fit H0 is the intersection of Fd (curvature data) and Fs (speed data). A succession of additional fits is then generated by appending to Hi the highest scoring curvature and speed candidates not already in Hi. To do this, on each cycle we create two new fits: Hi=Hi+vs (i.e., Hi augmented with the best remaining speed fit candidate) and Hi= Hi+vd (i.e., Hi augmented with the best remaining curvature candidate).
      Selecting the best fit:
     We set an error upper bound and designate as our final fit Hf, the Hi with the fewest vertices that also has an error below the threshold.
Handling curves:
Approximate curved region with Bezier curves, defined by two end points and two control points. Let u = si, v = sj, i < j be the end points of the part of S to be approximated with a curve. We compute the control points as:
Figure 4 Control points calculation formula
Where t1 and t2 are the unit length tangent vectors pointing inwards at the curve segment to  be approximated. The 1/3 factor in k controls how much we scale t1 and t2 in order to reach the control points; the summation is simply the length of the chord between Si and Sj.
(2)Beautification:
Modifies the output of the approximation layer to make it visually more appealing without changing its meaning, and secondarily aid the third phase, basic recognition.
Figure 5 The effect of Beautification
(3)Basic Recognition:
The recognition of the most basic objects that can be built from the line segments and curve segments produced so far. Eg. simple geometric objects( ovals, circles, rectangles and squares )

Bibliography:


Tevfik Metin Sezgin, Thomas Stahovich, Randall Davis, Proceeding PUI '01 Proceedings of the 2001 workshop on Perceptive user interfaces Pages 1-8
http://faculty.cs.tamu.edu/hammond/courses/SR/papers/Sezgin/Sezgin2001Sketchbased.pdf











2013年2月28日星期四

urjnasw xkfjjkn's new blog on Protractor: A Fast and Accurate Gesture Recognizer


A Fast and Accurate Gesture Recognizer
urjnasw xkfjjkn's extract on Protractor Paper
Protractor is faster and more accurate than other peer recognizers because it employs a novel method to measure the similarity between gestures, by calculating a minimum angular distance between them with a closed-form solution. Less memory demand and faster speed make it more suitable for mobile computing.

What is template-based recognizer? What are its cons and pros?
---In template-based recognizer, training samples are stored as templates, and at runtime, an unknown gesture is compared against these templates. Training samples are stored as templates, and at runtime, an unknown gesture is compared against these templates.
These recognizers are also purely data-driven, and they do not assume a distribution model that the target gestures have to fit. As a result, they can be easily customized for different domains or users, as long as training samples for the domain or user are provided.
Since a template-based recognizer needs to compare an unknown gesture with all of stored templates to make a prediction, it can be both time and space consuming, especially for mobile devices that have limited processing power and memory. However, Protractor is a special case.

How does Protractor work?
(1)     Protractor first resamples a gesture into a fixed number, N, equidistantly-spaced points, using the procedure described previously in $1 recognizer, and translate them so that the centroid of these points becomes (0, 0). This step removes the variations in drawing speeds and locations on the screen.
(2)     Next, Protractor reduces noise in gesture orientation.
When Protractor is specified to be orientation invariant, it rotates a resampled gesture around its centroid by its indicative angle, which is defined as the direction from the centroid to the first point of the resampled gesture.
When Protractor is specified to be orientation sensitive, it employs a different procedure to remove orientation noise. Protractor aligns the indicative orientation of a gesture with the one of eight base orientations that requires the least rotation. Since Protractor is data-driven, it can become orientation-invariant even if it is specified to be orientation-sensitive, e.g. if a user provides gesture samples for each direction for the same category.

Based on the above process, we acquire an equal-length vector in the form of (x1, y1, x2, y2, …, xN, yN) for each gesture. Note that Protractor does not rescale resampled points to fit a square as the $1 recognizer does because rescaling narrow gestures to a square will seriously distort them and amplify the noise in trajectories.
(3)     Classification by Calculating Optimal Angular Distances
For each pairwise comparison between a gesture template t and the unknown gesture g, Protractor uses the inverse cosine distance between their vectors, vt and vg, as the similarity score S of t to g.


From this, we can see Protractor is inherently scale invariant because the gesture size, reflected in the magnitude of the vector, becomes irrelevant to the distance.
Since the indicative angle is only an approximate measure of a gesture’s orientation, the alignment in the preprocessing cannot completely remove the noise in gesture orientation. This can lead to an imprecise measure of similarity and hence an incorrect prediction. To address this issue, at runtime, Protractor rotates a template by an extra amount so that it results in a minimum angular distance with the unknown gesture and better reflects their similarity.
Protractor employs a closed-form solution to find a rotation that leads to the minimum angular distance.

  Since we intend to rotate a preprocessed template gesture t by a hypothetical amount so that the resulting angular distance is the minimum (i.e., the similarity reaches its maximum), we formalize this intuition as:
Evaluation:
  Protractor is significantly faster than the $1 recognizer, the time needed for recognizing a gesture increases linearly for it.
  As training size increases, Protractor performs significantly more accurate than the $1 recognizer on this data set.
  Protractor uses N=16. But for $1 recognizer the paper mentioned that the good results are expected with 32<=N<=256. Protractor uses 1/4 of the space required by $1 recognizer. It would be interesting to see how the closed-form solution helped in decreasing N, still providing with good recognition results.

Bibliography:
Yang Li works at Google and He has done some amazing work in the area of HCI.