Monday 23 December 2013

Receiver operating characteristic (ROC) curve

ROC curve is a plot of true positive rate (true positive / total actual positive) vs. false positive rate (false positive / total actual negative, FP / (TN + FP) ). It shows the performance of a classifier as a trade off between selectivity (specificity, true negative rate TN / (TN + FP)) and sensitivity (recall rate, true positive rate).

false positive rate = 1 - true negative rate


Choosing the operating point

Let 
alpha = cost of false positive 
beta = cost of missing a positive (false negative)
p = proportion of positive cases

The average expected cost of classification at point x, y in the ROC space is 
C = (1-p) alpha x + p beta (1-y)

The error rate can be obtained by setting the misclassification costs equal to each other and unity. So
error rate = (1 - p) x + p (1 - y)

This equation means that points on the ROC space with equal error rate are straight lines. 

EER - euqal error rate is at the point where false positive rate = false negative rate.


References:

Sunday 15 December 2013

Sydney trip

Money matters

  • Withdrawal from a Westpac ATM (BoA affiliated) has a foreign transaction fee of 3%. There is no additional ATM withdrawal fee.
  • Use credit cards with no foreign transaction fees as often as possible.

Saturday 26 October 2013

Underscores or dashes

Both underscores and dashes can be used to connect words in a file system. Using dashes may make the names look nicer, but there are reasons to use underscores:

  • In Python, you have to name the files with underscores in order to import it in Python.
  • Many languages have the convention to name files with underscores: Python, Ruby, C++.
For coding, only underscores can be used in a variable name, because dashes will be confused with the minus sign. In Ruby, you can't use dashes in symbols either. So to be consistent, just use underscores in programs too, even in the string keys.

For url naming, it is a different story. If you have a underscore in the url, Google will combine the two words. So bla.com/wk1_kw2.html wouldn't show up by itself for kw1 or kw2. So when you search, you have to search for kw1_wk2 to bring up that page. 

Wednesday 23 October 2013

C# coding conventions

Capitalization
http://msdn.microsoft.com/en-us/library/ms229043.aspx

Accessibility modifiers
Default accessibility for class members is private, so it can be omitted for succinctness.

Classes and structs that are declared directly within a namespace (in other words, that are not nested within other classes or structs) can be either public or internal. Internal is the default if no access modifier is specified.

Private means type or member can be accessed only by code in the same class or struct.
Internal means the type or member can be accessed by any code in the same assembly, but not from another assembly.

IDisposable interface
The primary use of this interface is to release unmanaged resources. The garbage collector automatically releases the memory allocated to a managed object when that object is no longer used. However, it is not possible to predict when garbage collection will occur. Furthermore, the garbage collector has no knowledge of unmanaged resources such as window handles, or open files and streams.

Use the Dispose method of this interface to explicitly release unmanaged resources in conjunction with the garbage collector. The consumer of an object can call this method when the object is no longer needed.

As a rule, when you use an IDisposable object, you should declare and instantiate it in a using statement. The using statement calls the Dispose method on the object in the correct way, and (when you use it as shown earlier) it also causes the object itself to go out of scope as soon as Dispose is called. Within the using block, the object is read-only and cannot be modified or reassigned.
The using statement ensures that Dispose is called even if an exception occurs while you are calling methods on the object. You can achieve the same result by putting the object inside a try block and then calling Dispose in a finally block; in fact, this is how the using statement is translated by the compiler. 

Thursday 17 October 2013

Covariance matrix



If X is a random vector
  \mathbf{X} = \begin{bmatrix}X_1 \\  \vdots \\ X_n \end{bmatrix},
the covariance matrix is


\Sigma
= \begin{bmatrix}
 \mathrm{E}[(X_1 - \mu_1)(X_1 - \mu_1)] & \mathrm{E}[(X_1 - \mu_1)(X_2 - \mu_2)] & \cdots & \mathrm{E}[(X_1 - \mu_1)(X_n - \mu_n)] \\ \\
 \mathrm{E}[(X_2 - \mu_2)(X_1 - \mu_1)] & \mathrm{E}[(X_2 - \mu_2)(X_2 - \mu_2)] & \cdots & \mathrm{E}[(X_2 - \mu_2)(X_n - \mu_n)] \\ \\
 \vdots & \vdots & \ddots & \vdots \\ \\
 \mathrm{E}[(X_n - \mu_n)(X_1 - \mu_1)] & \mathrm{E}[(X_n - \mu_n)(X_2 - \mu_2)] & \cdots & \mathrm{E}[(X_n - \mu_n)(X_n - \mu_n)]
\end{bmatrix}.


Det(\(\Sigma\)) = \(\prod_{i=1}^n\lambda_i \geq 0\) where \(\lambda_i\)'s are eigenvalues of \(\Sigma\).

Since \(\Sigma\) is symmetric and positive definite, it can be diagonalized and its eigenvalues are all real and positive and the eigenvectors are orthogonal.
\begin{align}
det(\Sigma) = det(V\Lambda V^T) = det(V)\cdot det(\Lambda) \cdot det(V^T) = det(\Lambda)
\end{align}

\(det(V) = \pm 1\) because \(det(VV^{-1}) = det(V)det(V^{-1}) = det(V)det(V^T) = det(V)^2 = 1\)

References:
http://www.ece.unm.edu/faculty/bsanthan/EECE-541/covar.pdf

Wednesday 16 October 2013

Header file and source file in C++

Header files

The compiler doesn't compile header files since these are meant to be included into source files.

Even though one can put declarations and definitions for a class in the same file, there are good reasons to separate them, and sometimes there are also good reasons to put them together.

Member functions are implicitly inline they are defined side their class.

Why separating them
  • The definition may change more often than the definition. If you put them together and this file is included in anther source file, any change to the definition will cause the other dependent files to be recompiled again.

Monday 14 October 2013

Rvalue reference

Notes from the following sources:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2027.html
http://thbecker.net/articles/rvalue_references/section_01.html

An rvalue reference is formed by placing an && after some type.
A a;
A& a_ref1 = a; // an lvalue reference
A&& a_ref2 = a; // an rvalue reference
An rvalue reference behaves just like an lvalue reference except that it can bind to a temporary (an rvalue), whereas you can not bind a (non const) lvalue reference to an rvalue. You can also bind a rvalue reference to a lvaue.

A&  a_ref3 = A();  // Error!
A&& a_ref4 = A();  // Ok

An lvalue is an expression e that may appear on the left or on the right hand side of an assignment, whereas an rvalue is an expression that can only appear on the right hand side of an assignment. For example,
  int a = 42;
  int b = 43;

  // a and b are both l-values:
  a = b; // ok
  b = a; // ok
  a = a * b; // ok

  // a * b is an rvalue:
  int c = a * b; // ok, rvalue on right hand side of assignment
  a * b = 42; // error, rvalue on left hand side of assignment

What we get from rvalue references is more general and better performing libraries.

Move Semantics

Eliminating spurious copies

Copying can be expensive. For example, for std::vectors, v2=v1 typically involves a function call, a memory allocation, and a loop. This is of course acceptable where we actually need two copies of a vector, but in many cases, we don't: We often copy a vector from one place to another, just to proceed to overwrite the old copy. Consider:
template <class T> swap(T& a, T& b)
{
    T tmp(a);   // now we have two copies of a
    a = b;      // now we have two copies of b
    b = tmp;    // now we have two copies of tmp (aka a)
}
But, we didn't want to have any copies of a or b, we just wanted to swap them. Let's try again:
template <class T> swap(T& a, T& b)
{
    T tmp(std::move(a));
    a = std::move(b);   
    b = std::move(tmp);
}
This move() gives its target the value of its argument, but is not obliged to preserve the value of its source. So, for a vectormove() could reasonably be expected to leave its argument as a zero-capacity vector to avoid having to copy all the elements. In other words,move is a potentially destructive read.
In this particular case, we could have optimized swap by a specialization. However, we can't specialize every function that copies a large object just before it deletes or overwrites it. That would be unmanageable.
The first task of rvalue references is to allow us to implement move() without verbosity, or runtime overhead.

move

The move function really does very little work. All move does is accept either an lvalue or rvalue argument, and return it as an rvalue without triggering a copy construction:
template <class T>
typename remove_reference<T>::type&&
move(T&& a)
{
    return a;
}
The functions that accept rvalue reference parameters (including move constructorsmove assignment operators, and regular member functions such as std::vector::push_back) are selected, by overload resolution, when called with rvalue arguments (either prvalues such as a temporary objects or xvalues such as the one produced by std::move). If the argument identifies a resource-owning object, these overloads have the option, but aren't required, to move any resources held by the argument. For example, a move constructor of a linked list might copy the pointer to the head of the list and store NULL in the argument instead of allocating and copying individual nodes.

It is now up to client code to overload key functions on whether their argument is an lvalue or rvalue (e.g. copy constructor and assignment operator). When the argument is an lvalue, the argument must be copied from. When it is an rvalue, it can safely be moved from.

Return by value

Be honest: how does the following code make you feel?
std::vector<std::string> get_names();
…
std::vector<std::string> const names = get_names();
Frankly, even though I should know better, it makes me nervous. In principle, when get_names() returns, we have to copy a vector of strings. Then, we need to copy it again when we initialize names, and we need to destroy the first copy. If there are N strings in the vector, each copy could require as many as N+1 memory allocations and a whole slew of cache-unfriendly data accesses as the string contents are copied.
Rather than confront that sort of anxiety, I’ve often fallen back on pass-by-reference to avoid needless copies:
get_names(std::vector<std::string>& out_param );
…
std::vector<std::string> names;
get_names( names );
Unfortunately, this approach is far from ideal.
  • The code grew by 150%
  • We’ve had to drop const-ness because we’re mutating names.
  • As functional programmers like to remind us, mutation makes code more complex to reason about by undermining referential transparency and equational reasoning.
  • We no longer have strict value semantics1 for names.

Copy Elision and the RVO

The reason I kept writing above that copies were made “in principle” is that the compiler is actually allowed to perform some optimizations based on the same principles we’ve just discussed. This class of optimizations is known formally as copy elision. For example, in the Return Value Optimization (RVO), the calling function allocates space for the return value on its stack, and passes the address of that memory to the callee. The callee can then construct a return value directly into that space, which eliminates the need to copy from inside to outside. The copy is simply elided, or “edited out,” by the compiler. 
Also, although the compiler is normally required to make a copy when a function parameter is passed by value (so modifications to the parameter inside the function can’t affect the caller), it is allowed to elide the copy, and simply use the source object itself, when the source is an rvalue.
Guideline: Don’t copy your function arguments. Instead, pass them by value and let the compiler do the copying. But if there is no copy involved in the function, it should be better to use const reference.
References:

Saturday 21 September 2013

CUDA programming

Coalescing: compute capability >= 1.2

  • Possible bus transaction sizes: 32B, 64B,  or 128B
  • Memory segment must be aligned: first address = multiple of segment size

Add CUDA programming support in Visual Studio

Sunday 8 September 2013

String matching algorithms

Rabin-Karp algorithm

This is an application of rolling hash. The expected running time is O(n + m) where m is the length of the pattern p[0...m - 1] and n is the length of the source text.

Horspool

Horspool algorithm is a variation based on Knuth-Morris-Pratt algorithm and Boyer-Moore algorithm. These algorithms are all based on the idea that using the information of the pattern and the text we can shift the position of comparison by more than 1 position. They all require pre-computing a shift table.

The shift table in the Horspool algorithm is called bad character shift table. For each character c in p[0...m-2], bad_char_skip[c] = m - 1 - position of  last occurrence of c in p[0...m-2]. When a character is encountered that does not occur in the pattern, we can safely skip ahead for the whole length of the pattern.

The algorithm is as follow:
  1. align pattern p with text and move from right to left comparing each character in p[0...m-1] with corresponding character in t[i...i+m-1]
  2. if the left end of p is reached return the position in t that aligns with the left most in p
  3. if before reaching the left end of p, we get a mismatch, and we need to shift right. The number of position to shift is based t[i+m-1], i.e. bad_char_skip[t[i+m-1]].
Complexity: it has an average-case complexity of O(n) on randome text, although it has O(mn) in the worst case.

Z algorithms

Hashing

Hash table is implemented as an array that allows random access. There are two common ways to deal with key collisions:
  • Chaining: colliding elements are put in a linked list in each slot of table. This increase the total size of the data structure. The expected running time for search is \(\Theta(1 + \alpha)\) where \(\alpha\) = # keys stored in the table / # slots in table, i.e., the load factor.
  • Open addressing:  the hash function takes two arguments, the key k and the trial number i. The hash function has the property that if we keep trying h(k, i), we will even hit all slots of the table. An example of such a hash function is double hashing: \(h(k, i) = h_1(k) + i\cdot h_2(k)\) 

Monday 26 August 2013

Multithreaded programming

In multithreaded programming, a thread is sometime required to block until some condition is true. In thread systems pre-dating Java, this mechanism is generally called "condition variables" and corresponds to a separately allocated object. For example using the pthread library, you do create and use a condition variable like this:
pthread_cond_t count_threshold_cv;
pthread_mutex_t count_mutex;
pthread_cond_wait(&count_threthold_cv, &count_mutex); // atomically unlock mutex while waiting
phtread_cond_signal(&count_threthold_cv);
When locks and condition variables are used together, the result is called a monitor:

  • A collection of procedures manipulating a shared data structure.
  • One lock must be held whenever accessing the shared data
  • One or more condition variables used for waiting.
In C#, there is no separate type for the condition variable. Instead every object inherently implements one condition variable, and the "Monitor" class provides static "Wait", "Pluse" and "PulseAll" methods to manipulate an object's condition variable. They are used together with lock or Monitor.Entor and Monitor.Exit.
public static KV GetFromList() {
  KV res;
  lock (typeof(KV)) {
    while (head == null) Monitor.Wait(typeof(KV));
    res = head; head = res.next;
    res.next = null; // for cleanliness
  }
  return res;
}

In C# every object also inherently implements a mutual exclusion lock.

Friday 23 August 2013

Windows 7 shortcut


  • Win+Home: Clear all but the active window.P
  • Win+Space: All windows become transparent so you can see through to the desktop.P
  • Win+Up arrow: Maximize the active window.P
  • Shift+Win+Up arrow: Maximize the active window vertically.P
  • Win+Down arrow: Minimize the window/Restore the window if it's maximized.P
  • Win+Left/Right arrows: Dock the window to each side of the monitor.P
  • Shift+Win+Left/Right arrows: Move the window to the monitor on the left or right.
Source:

Monday 8 July 2013

HHMM and HMM

An HMM is a stochastic finite automaton, where each state generates an observation. The observation can be a discrete symbol or a feature vector. The parameters of the model are the initial state distribution, the transition model, and the observation model. An HMM can be represented by a state transition diagram or by a Bayesian network.

Any HHMM can be converted to HMM. The resulting state-space may be smaller, because it does not contain abstract states. This occurs if there are not many shared sub-structures. The state-space may be larger if there are shared sub-structures because they must be duplicated.  Whether the state-space will be lager or smaller, depends on the ratio between the number of hidden states \(n_h^S\) in  HHMM and the number of hidden states \(n^S\) in HMM.

In general, a dynamic Bayesian network (DBN) can be converted to an HMM if all the hidden nodes are discrete. In this case, using the HMM inference engine can be faster than the using the junction tree inference engine for small models because the constant factors of the algorithm are lower, but can be exponentially slower for models with many variables (e.g.,  > 6 binary hidden nodes).

HMM parameter learning: EM algorithm
n sequences with length m
\begin{align}
t^t(s'|s) = \frac{\sum_{i = 1}^n\sum_{j=1}^m p(S_j = s, S_{j+1} = s'|x_{i, 1}\ldots x_{i, m};\underline{\theta})}{\sum_{i = 1}^n\sum_{j=1}^m \sum_{s'} p(S_j = s, S_{j+1} = s'|x_{i, 1}\ldots x_{i, m};\underline{\theta})}
\end{align}
$$\sum_{s'} t(s'|s) = 1$$

Sunday 7 July 2013

Ruby on Rails

Association

Auto-generated methods

Singular associations (one-to-one)

| | belongs_to |
generated methods | belongs_to | :polymorphic | has_one
----------------------------------+------------+--------------+---------
#other | X | X | X
#other=(other) | X | X | X
#build_other(attributes={}) | X | | X
#create_other(attributes={}) | X | | X
#other.create!(attributes={}) | | | X
#other.nil? | X | X |
Collection associations (one-to-many / many-to-many)

| | | has_many
generated methods | habtm | has_many | :through
----------------------------------+-------+----------+----------
#others | X | X | X
#others=(other,other,...) | X | X | X
#other_ids | X | X | X
#other_ids=(id,id,...) | X | X | X
#others<< | X | X | X
#others.push | X | X | X
#others.concat | X | X | X
#others.build(attributes={}) | X | X | X
#others.create(attributes={}) | X | X | X
#others.create!(attributes={}) | X | X | X
#others.size | X | X | X
#others.length | X | X | X
#others.count | X | X | X
#others.sum(args*,&block) | X | X | X
#others.empty? | X | X | X
#others.clear | X | X | X
#others.delete(other,other,...) | X | X | X
#others.delete_all | X | X |
#others.destroy_all | X | X | X
#others.find(*args) | X | X | X
#others.find_first | X | |
#others.uniq | X | X | X
#others.reset | X | X | X

Auto completer

Saturday 6 July 2013

Trianglulate a graph

Triangulation ensures a graph G is triangulated (chordal), i.e., every cycle of length > 3 has a chord. During the process, we can also find maximal cliques in the triangulated graph.

Pseudo code:

// G is moralized.
triangulate(G, order):
1. eliminated = {}
2. cliques = {}
3. for i = 1 : n
       u = order(i)
       nodes = neighbors of u which are not eliminated yet \(\cup\) u
       make every node in nodes all connected to each other
       add u to eliminated
       check whether nodes is a subset of any clique already found, if not, add nodes to cliques.
     
     

Tuesday 2 July 2013

Algorithm complexity

http://bigocheatsheet.com

Shortest path

Dijkstra: \(\Theta(V)\) inserts into priority queue, \(\Theta(V)\) EXTRACT_MIN operations, \(\Theta(E)\) DECREASE_KEY operations.
Min-heap implementation: \(\Theta((V + E)\log V)\)
Fibonacci heap implementation: \(\Theta(\log V)\) for exact min, \(\Theta(1)\) for decrease key and insertion, total amortized cost \(\Theta(V\log V + E)\)

Fibonacci heap


A list of heap-ordered trees. We access the heap by a pointer to the tree root with the overall minimum key.

Main operations

Inserting a node: create a new tree containing only x and insert it into the root list of H.
Decrease-key: cut the node from its parent, then insert the node into the root list with a new key. \(O(1)\)
Delete-min: remove the root of the smallest key, add its children to the root-list, and scan through the linked list of all the root nodes to find the new root of minimum key. We also need need to consolidate the root list by merging roots with the same degree (same number of children). We can think of consolidation as allocating buckets of size up to the maximum possible rank for any root node, which is O(log n). We put each node into the appropriate bucket and march through the buckets, starting at the smallest one, and consolidate everything possible. Therefore the cost of delete-min operation is O(# of children) of the root of the minimum key plus O(# of root nodes).

Population control for roots

We want to make sure that every node has a small number of children. This can be done by ensuring that the total number of descendants of any node is exponential in the number of its children. One way to do this is by only merging trees that have the same number of children (degree).

Let \(s_k\) be the minimum size of any node of degree k in any Fibonacci heap. Let \(y_1, y_2, \ldots, y_k\) denote the children of x in the order in which they were linked to x
\begin{align}
size(x) &\ge s_k \\
& \ge 2 + \sum_{i = 2}^k s_{y_i.degree} \\
& \ge 2 + \sum_{i = 2}^k s_{i - 2} \\
& \ge 2 + \sum_{i = 2}^k F_i \; \text{(by induction)}\\
& = 1 + \sum_{i = 0}^k F_i \\
& = F_{k + 2}\\
&\ge \phi^k
\end{align}

Fibonacci numbers

\begin{align}
F_0 &= 0, \\
F_1 &= 1, \\
F_i &= F_{i - 1} + F_{i - 2} \;\text{for } i \ge 2.
\end{align}

Fourier transformation

The convolution integral

  • The output y(t) to an input x(t) is seen as a weighted superposition of impulse response time-shifted by \(\tau\).
  • The expression is called the convolution integral and denoted by the same symbol * as in the discrete-time case\[y(t) = \int_{-\infty}^\infty x(\tau)h(t-\tau)d\tau = x(t)*h(t)\]

Harmonic signals (complex exponential) are eigenfunctions for LTI systems. That is,
\[\mathcal{H}f = \lambda f\].

Suppose the input is \(x(t) = Ae^{st}\). The output of the system with impulse repose h(t) is then
\begin{align}\int_{-\infty}^\infty h(t-\tau)Ae^{s\tau}d\tau & = \int_{-\infty}^{\infty}h(\tau)Ae^{s(t-\tau)}d\tau = Ae^{st}\int_{-\infty}^{\infty}h(\tau)e^{-s\tau}d\tau \\ & =Ae^{st}H(s)\end{align}
where H(s) is a scalar dependent only on the parameter s.

Fourier series
Continuous time
\[X[k] = \frac{1}{T}\int_0^Tx(t)e^{-jk\omega_0t}dt\]

Discrete time
\[X[k] = \frac{1}{N}\sum_{n=<N>}x[n]e^{-jk\Omega_0n}\]

Friday 21 June 2013

Junction tree algorithm for DBN

A naïve approach to do inference on dynamic Bayesian network (DBN) is to unroll DBN for desired number of timesteps and treat it as a static BN. The problems with this are

  • Typically the number of timesteps are unknown, and one may use a time window with a fixed length or do some kind of segmentation.
  • Space requirement is high
  • Dynamic filtering requires rebuilding the entire time history of the process on every timestep. (?)

Understanding the 2TBN (2 time-slice Bayes net) junction tree algorithm implementation

First of all, a junction tree is a clique tree that satisfies the junction tree property. A clique tree comes from the definition of a clique graph. A clique graph of G is a graph where the set of nodes is precisely the set of maximal cliques in G. And if the clique graph is also a tree, it is a clique tree. The junction tree property means that all the maximal cliques containing the same node v must form a connected subtree.

This is one of the algorithm in Kevin Murphy's Bayesian Network Toolbox. The file name is jtree_2TBN_inf_engine.m.

In the constructor of jtree_2TBN_inf_engine, it creates two jtree_engines. It creates one engine just for slice 1. And another engine for 1.5 slice. The 0.5 slice means the interface nodes from the previous slice. Interface nodes are nodes with children in the next slice. In the case of one-level AHMM, the interface nodes include \(G_t, S_t, F_t^G\). So in this case, the interface algorithm (Murphy, 2002) is the same as the frontier algorithm (Zweig, 1996). In the frontier algorithm, the frontier is the full set of hidden nodes.  It also always makes the interface nodes a clique.

One-level AHMM

Resources: http://ttic.uchicago.edu/~kgimpel/papers/jt-dbn.pdf

Thursday 20 June 2013

Computational complexity

http://courses.csail.mit.edu/6.006/fall11/lectures/lecture23.pdf

NP = {decision problems with solutions that can be "checked" polynomial time}
  • It is based on the nondeterministic model where an algorithm makes guesses and then says YES or NO.
  • Examples: Tetris
NP-hard = "as hard as" every problem in NP

NP-complete = NP \(\cap\) NP-hard
  • Tetris is also NP-complete
  • Other examples: knapsack, 3-Partition (given n integers, can you divide them into triples of equal sum?), traveling salesman problem, longest common subsequence of k strings, minesweeper, Sudoku, most puzzles, SAT, 3-coloring a given graph, find largest clique in a give graph

Graphical model

Graphical model is a method of using graphs (nodes and edges) to represent dependence or independence relationships among the random variables.

Approximate inference in Bayesian networks
Gibbs sampling

Friday 7 June 2013

Matlab style guide

Here are some Matlab style guide combined from various sources. I added some changes based on personal preferences and consistency.
  1. Kevin Murphy's style guide
  2. http://www.datatool.com/downloads/matlab_style_guidelines.pdf

 Names

  • Capitalize constants.
  • Use camel back notation for vairables, as in hiddenMarkovModel. CamelBack always starts with a lower case letter, and then uses an upper case letter for each new word. Acronyms, such as GaussianHMM, also follow this convention, so would be written as gaussianHmm.
  • Names of functions should be written in lower case. 
    • It is clearest to have the function and its m-file names the same. Using lower case avoids potential filename problems in mixed operating system environments. 
    • To improve readability, underscore may be used.
  • Prefix variables denoting a number of elements with the letter n as in nValue for the number of values. 
  • Suffix variables storing indices with Ndx as in dataNdx

Comments

Sunday 2 June 2013

Git submodules

When you clone a git repository with submodules, you should add the --recursive option.
git clone --recursive git://github.com/foo/bar.git
If you forget to do that during the cloning, you can update the submodules by
git submodule update --init
init adds the submodule repository URLs to .git/config, and update checks out a specific commit, rather than the tip of a branch.

Thursday 23 May 2013

Thursday 16 May 2013

Namespaces in C# Visual Studio development

There are several namespaces in C# Visual Studio development that are not obvious where to find and add them.

The System.Windows is in the PresentationCore assembly. The System.Windows.Media, System.Windows.Controls and System.Windows.Shapes are in PresentationFramework assembly.

Wednesday 15 May 2013

Vim on Windows

On Linux, all the Vim plugin and autoload files are put in ~/.vim folder. On Windows, they should be put in the ~/vimfiles folder. For example to use pathogen, put the pathogen.vim file in ~/vimfiles/autoload; otherwise Vim will not be able to find the .vim file.

Tuesday 14 May 2013

Running STIP feature detection on Ubuntu 12.04

Ivan Laptev's code on Space-time interest points (STIP) detection can be downloaded here. To run it on Ubuntu 12.04, there are several dependencies you need to install.

First you need to install OpenCV. As the newer version of OpenCV's library names are changed and the code is compiled with the older version of the library, you need to make symbolic links that point from the old library names to the new library names. The stackoverflow post lists all the symbolic links you need to make.

In order to make video reading work with OpenCV, you need to compile OpenCV with libgtk2.0-dev and pkg-config installed. Also you need libgstreamer0.10-dev and libgstreamer0.10-vaapi-dev. The libgstreamer0.10-vaapi-dev provides the headers such as gst/video/video.h which is required in compiling OpenCV.

Tuesday 23 April 2013

mmread.m

mmread.m is a Matlab function for reading different media files on different platforms (Linux, Windows and Macs). It is used in the sample code of 2011/2012 One-shot-learning gesture recognition challenge.

When I first run example.m in the sample code, it asks me to install libavbin.so and I clicked yes. However I got segmentation fault after the installation. I also tried to install libavbin.so from the version downloaded from https://code.google.com/p/avbin/, but this causes an error (undefined symbol: 
avbin_open_filename_with_format error) in FFGrab.mexa64. The author noted that he fixed this bug. So I downloaded mmread from the Matlab file exchange and replaced that in the sample code. Now mmread works.


Tuesday 16 April 2013

A Unified Framework for Concurrent Usage of Hand Gesture, Shape and Pose

Natural interaction with hands

  • hand trajectory
    • HMM based continuous recognition
    • not really natural
  • hand shape
    • Use depth image directly / label / recognition
  • articulated hand pose
    • Hand skeleton extraction
[Download][Summary][Slides]



    One-shot gesture recognition

    Data
    User dependent:  the gestures in one batch are performed by a single user. There is a single labeled training example of each gestures of the vocabulary in a given batch. The goal of the challenge is, for each batch, to train a system on the training examples, and to make predictions of the labels for the test examples. The test labels in the validation batches are withheld. Additional batches finalXX will be provided for final testing.

    Overall analysis
    https://docs.google.com/file/d/0B08QS7nJpK7mX3pFalVzckxoU2M/edit

    Winners' methods
    1st place
    https://docs.google.com/file/d/0B4jW8HPqnNiuU2RiQWl6TnpfQzQ/edit

    Initial image preprocessing

    • used depth information only
    • Identify outliers (those pixels returned as 0 by kinect) and remove outliers (simple/fast wavelet reconstruction)

    Representation of visual features

    • Mimic behavioral and neural mechanisms underlying visual processing
    • Selected features of interest (emphasize moving close to the camera gestures)
    • Feature/background separation
    • Encode features time-varying shape and trajectory
    • Similarity measure (robust to variability in features selection or location)
    Gesture recognition algorithm
    • General Bayesian network model similar to speech recognition literature
    • Can perform simultaneous recognition and segmentation
    • Compute similarities between each input video frame with sample gesture video frames

    Tuesday 26 March 2013

    Gesture classification pipeline

    There are a couple gesture classification pipeline which provides a clean and nice framework. Here I am going to list their basic structures.

    • Gesture recognition toolkit
      • pipeline
        • pre-processing
        • feature extraction: should be chainable
        • classifier
        • post-processing 
      • methods
        • train
        • test
          • F measure
          • precision
          • recall

    Monday 18 March 2013

    Mahalanobis distance and Gaussian distribution

    The Mahalanobis distance is a distance measure that accounts for the covariance or "stretch" of the shape in which the data lies.

    $$D_{\text{Mahalanobis}}(\mathbf{x}, \mathbf{y}) = \sqrt{(\mathbf{x} - \mathbf{y})^T\Sigma^{-1}(\mathbf{x} - \mathbf{y})}$$

    This is very similar to the exponential term in the Gaussian distribution.

    It is useful to get an intuitive feel about the standard deviation. In the one-dimensional case, where
    $$ p(x) = \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(x - \mu)^2}{2\sigma}}$$
    the probability at one standard deviation away from the center is \(\frac{1}{e}\) (37%) of the peak probability.


    Source:
    https://en.wikipedia.org/wiki/File:Normal_Distribution_PDF.svg

    Friday 15 March 2013

    Go and Scala

    Start to learn Go and Scala. I'm going to jot down some notes here.

    Both Scala and Go have similar type declaration syntax, with type name on the right.
    Scala:
    class Person(first: String, lastName: String)
    Go:
    func needInt(x int) int {return x * 10 + 1}

    Wednesday 13 March 2013

    Machine learning

    Perceptron

    Repeat until convergence:
    For t = 1 ... n
    1. \(y'\) = sign(\(\underline{x}_t\cdot\underline{\theta}\))
    2. If \(y'\ne y_t\) Then \(\underline{\theta} = \underline{\theta} + y_t\underline{x}_t\), Else leave \(\underline{\theta}\) unchanged.

    Kernel form of the perceptron

    • Definition: for any \(\underline{x}\), define \(g(x) = \sum_{j=1}^n\alpha_jy_jK(\underline{x}_j,\underline{x})\) where \(K(\underline{x}_j,\underline{x}) = \phi(\underline{x}_j)\cdot\phi(\underline{x})\)
    • Repeat until convergence:
      • For t = 1 ... n
        1. y' = sign(g(\(\underline{x}_t\))
        2. If \(y'\ne y_t\) Then \(\alpha_t = \alpha_t + 1\)

    Monday 11 March 2013

    Algorithms for inference

    Definitions

    Chordal graph: a graph is chordal if any cycle of the graph of size >= 4 has a chord.

    Formulas

    Parameter estimation for HMM with Gaussian distribution for emission probabilities:
    $$\underline{\mu}_s^t = \frac{\sum_{i=1}^n\sum_{j=1}^mp(S_j = s|\underline{x}_{i,1}\dots\underline{x}_{i,m};\theta)\underline{x}_{i,j}}{\sum_{i=1}^n\sum_{j=1}^mp(S_j = s|\underline{x}_{i,1}\dots\underline{x}_{i,m};\theta)}$$

    General form of EM algorithm

    The basic idea is that we want to maximize the likelihood of the "complete" data. Because we don't know the "complete" data, we need to take the expectation of the "complete" data with respect to the probability distribution of the hidden variables.

    Let \(\mathbf{Z}\) denote all the hidden variables for all the examples, and let \(\boldsymbol{\theta}\) be all the parameters for the probability model. \begin{aligned}\boldsymbol{\theta}^{(i+1)} &= \underset{\boldsymbol{\theta}}{\arg\max}\sum_{\boldsymbol{z}}P(\boldsymbol{Z} = \boldsymbol{z}|\boldsymbol{x}, \boldsymbol{\theta}^{(i)})L(\boldsymbol{x}, \boldsymbol{Z} = \boldsymbol{z}|\boldsymbol{\theta}) \\ &=\underset{\boldsymbol{\theta}}{\arg\max}E_{P(\boldsymbol{Z} = \boldsymbol{z}|\boldsymbol{x}, \boldsymbol{\theta}^{(i)})}L(\boldsymbol{x}, \boldsymbol{Z}= \boldsymbol{z}|\boldsymbol{\theta}) \end{aligned}
    The E-step is the computation of the summation, which is the expectation of the log likelihood of the "completed" data with respect to the distribution \(P(\boldsymbol{Z} = \boldsymbol{z}|\boldsymbol{x}, \boldsymbol{\theta}^{(i)})\).

    Sunday 10 March 2013

    Complexity theory

    The class NP consists decision problems with solutions that can be "checked" in polynomial time. Equivalently, it can also be defined as a set of decision problems solvable in polynomial time via a "lucky" algorithm. This is a nondeterministic model where an algorithm makes guesses and then says YES or NO.  The name "NP" stands for "nondeterministic polynomial time".

    A problem is in the class NP-complete, if it is in NP and is as "hard" as any problem in NP. So NP-complete problems are the hardest problems in NP.

    Examples of NP-complete problems:

    • Knapsack (pseudopolynomial, O(nS) where n is number of items and S is the size of the knapsack)
    • Traveling salesman problem
    • SAT
    • Tetris
    • 3-coloring a given graph

    MIT 6.006 notes: http://courses.csail.mit.edu/6.006/fall11/lectures/lecture23.pdf

    TED talk: How movies teach manhood



    I really enjoyed this talk. Colin Stokes is both eloquent and humorous. I agree with his observation that there are movies that teach girls to fight against patriarchy (like many Disney movies), but they are all tailored to girls and there are very few movies that teach boys to fight against patriarchy. As a result, there are no role models for boys to know how to respect women and treat them as equals. In a quest to achieve equality, we need work on both sides: encouraging girls to pursue their dreams without restrained by stereotypes and fostering the idea that female leadership is as normal as male leadership in boys. That's why Stokes asks for more movies that send positive messages to boys: that cooperation is heroic, and respecting women is as manly as defeating the villain.

    In his talk, he mentioned the Bechdel test for movies. It is used to identify gender bias in fiction. I've heard this test before on a graduate women mailing list about showing movies that pass the test. The test has 3 requirements:

    1. Are there at least 2 named female characters?
    2. Did the two talk to each other?
    3. Did they talk about something other than men?
    You have to agree that these these criteria are some what simplistic, but it can be a good start. Even with this these simple criteria, there are only about 50% of the movies in a year pass it.

    Stokes also mentioned the movie Brave, which is an nontraditional princess movie. Unlike the usual princess movies which resolve around a male romantic attachment, this movie is purely about mother-daughter relationship. There is no point in the movie that the princess becomes romantic with a male character. This is indeed quite rare in Hollywood movies. It's good that Brave also wins the Oscar for the best animated film. 

    Sunday 3 March 2013

    Matlab save path

    Matlab allows you to add paths to the search path and you are supposed to be able to save it so that you don't need to add the paths again the next time. However when I tried to save the paths, the new paths didn't show up when I restarted Matlab. I'm running MATLAB R2012a and Ubuntu 12.04.

    This post helps me solve the problem. After saving the new paths, a new file is created in ~/Documents/MATLAB because I do not have write permission in the default installation location. However the new pathdef.m file is not read when Matlab starts. To solve this, you can add startup.m in ~/Documents/MATLAB if that's your urserpath. Add the following line to the file:
    path(pathdef)

    On Windows 7, when you set path through the dialog box, it will add the new paths to the pathdef.m file in the installation folder. This will affect other users' path. To add the paths so only you will see it, you can do the similar thing before by adding a startup.m file in C:\Users\[username]\Documents\MATLAB if that is your statup path. On Windows, it won't ask you to specifiy a new file to save the paths, so you need to copy the pathdef.m file from the installation folder (default is C:\Program Files\MATLAB\R2013a\toolbox\local\pathdef.m) to the userpath.

    However, one problem I encountered when I start from the userpath is that some functions in the parallel computing toolbox no longer work. Turns that there seem to be some conflicts between the toolbox and the Bayesian Network Toolbox (BNT) I'm using. After moving the BNT to the bottom of the search path, everything works again.

    Wednesday 27 February 2013

    Ray Kurzweil



    Ray Kurzweil is a prolific inventor and is involved in fields like OCR, text-to-speech synthesis, and speech recognition. He did his undergraduate at MIT. In this talk, he emphasized the exponential growth of information technology. I think everyone can see that information technology is growing very fast. Just in the short span of my life so far, my daily life has transitioned from one that is almost devoid of electronic devices to one in which computers, smart phones, tablets etc are indispensable. However, I haven't thought too much about the actual growth speed before this talk. Grasping the concept of exponential growth is almost eye-opening. This really makes me feel very excited because the range of possibilities that technology can achieve is kind of unfathomable. It's exciting to live in this age of the world and I also feel an urgency in my work. I have to work fast, otherwise the work will be outdated very soon.  

    Sunday 24 February 2013

    Vim cheat sheet

    Compound command
    Compound commandEquivalent in longhand Effect
    Cc$clear to the end of the line
    sclclear to right
    S^Cclear to the start of the line
    I^iinsert from the start of the line
    A$ainsert after the end of the line
    oA<CR>insert to the new line below
    Okoinsert to the new line above

    Wednesday 20 February 2013

    Bayesian network

    The posterior probability is the probability of the parameters given the evidence : \(P(\theta|X)\).
    Likelihood is the probability of the evidence given the parameters: \(P(X|\theta)\).

    maximum a posteriori probability (MAP) estimate is a mode of the posterior distribution. It can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to maximum likelihood (ML), but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of ML estimation.

    Maximum likelihood estimate of \(\theta\) is $$\hat\theta_{ML}(x) = \underset{\theta}{\arg\max}f(x|\theta)$$
    If \(g(\theta)\) is a prior distribution over \(\theta\), $$\hat{\theta}_{MAP} = \underset{\theta}{\arg\max}f(x|\theta)g(\theta)$$

    Conjugate prior

    In Bayesian probability theory, if the posterior distributions \(p(\theta|x)\) are in the same family as the prior probability distribution \(p(\theta)\), the prior and posterior are then called conjugate distribution, and the prior is called a conjugate prior for the likelihood. For example, the Gaussian family is conjugate to itself with respect to a Gaussian likelihood function: if the likelihood function is Gaussian, choosing a Gaussian prior over the mean will ensure that the posterior distribution is also Gaussian.

    The posterior distribution of a parameter \(\theta\) given some data x is $$p(\theta|x) = \frac{p(x|\theta)p(\theta)}{\int{p(x|\theta)p(\theta)d\theta}}$$ A conjugate prior is an algebraic convenience, giving a closed-form expression for the posterior.

    Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution. The Dirichlet distribution with parameters \(\alpha_1, \dots, \alpha_K > 0\) has a probability density function given by $$f(x_1, \dots, x_{K-1};\alpha_1, \dots, \alpha_K) \propto \prod_{i = 1}^Kx_i^{\alpha_i-1}$$ for \(x_i > 0\) and \(\sum x_i = 1\). This makes it suitable to be a prior distribution for a model parameter \(\boldsymbol{\theta}\) for a multinomial distribution where \(\boldsymbol{\theta}_i = x_i\).

    Linear algebra reivew

    Homogeneous systems

    $$b_{11}x_1 + b_{12}x_2 + \dots + b_{1n}x_n = 0 \\ b_{21}x_1 + b_{22}x_2 + \dots + b_{2n}x_n = 0 \\ \dots \\ b_{p1}x_1 + b_{p2}x_2 + \dots + b_{pn}x_n = 0$$

    Properties:
    • Has at least one 1 solution [0, 0, ..., 0]
    • If it has a non-zero solution, then it has infinite number of solutions.

    Inverse of a matrix

    A matrix \(A\) has an inverse if one of the following holds:
    • \(det(A)\neq 0\)
    • The reduced form of A is the identity matrix.
    • \(A\) has full rank.
    • The homogeneous equation \(Ax=0\) has a unique solution.
    • The equation \(Ax = b\) has a unique solution for every b.

    Eigenvalues and eigenvectors

    An eigenvector is a non-zero vector which is transformed to a scalar multiple of itself.
    The eigenvalue equation for a matrix \(A\) is \(Av-\lambda v = 0\), which is equivalent to \((A - \lambda I)v = 0\).
    This equation only has non-zero solutions if and only if \(det(A - \lambda I) = 0\), i.e.\((A - \lambda I)\) is singular and not invertible.

    Showing that an eigenbasis makes for good coordinate systems

    Theorems 

    Number of eigenvalues of a matrix

    Suppose that A is a square matrix of size n with distinct eigenvalues \(\lambda_1, \lambda_2, \lambda_3,\dots,\lambda_k\). Then \(\sum_{i=1}^k\alpha_A(\lambda_i) = n\), where \(\alpha_A(\lambda_i)\) is the algebraic multiplicity of \(\lambda_i\).

    Maximum number of eigenvalues of a matrix

    Suppose that A is a square matrix of size n. Then A cannot have more than n distinct eigenvalues.

    Spectral theorem

    Consider a Hermitian map A on a finite-dimensional real or complex inner product space V endowed with a positive definite Hermitian inner product. The Hermitian condition means
     (\forall x,y\in V): \langle A x ,\, y \rangle =  \langle x ,\, A y \rangle .
    An equivalent condition is that A* = A where A* is the hermitian conjugate of A. In the case that A is identified with an Hermitian matrix (one which is equal to its own conjugate transpose), the matrix of A* can be identified with its conjugate transpose. If A is a real matrix, this is equivalent to AT = A (that is, A is a symmetric matrix).
    Theorem. There exists an orthonormal basis of V consisting of eigenvectors of A. Each eigenvalue is real.

    So in less precise terms, and considering only real numbers: if A is a symmetric matrix, its eigenvectors form an orthonormal basis.

    Principal component analysis

    One of the applications involving eigenvalues and eigenvectors is PCA. It transforms a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The first principal component has the largest possible variance (that is, accounts for as much as of the variability in the data as possible).

    PCA is used in the eigenface technique for face recognition.

    One question one would ask is why the eigenvectors of the covariance matrix \(\textbf{V}\) of the data are the principal component. So by definition, the first principal component \(\textbf{w}\) has the largest possible variance. If we project all the data in to this direction. The variance of the resultant data is \(\textbf{w}^T\textbf{Vw}\). So we want to choose a unit vector \(\textbf{w}\) to maximize the variance. Note that we need to constrain the maximization otherwise there is no maximum point of the objective function. The constraint is \(\textbf{w}\) is a unit vector so that \(\textbf{w}^T\textbf{w} = 1\). To do constrained optimization, we need to use Lagrange multiplier. The Lagrange function is thus
    \begin{align}L(\textbf{w}, \lambda) &= \textbf{w}^T\textbf{w} - \lambda(\textbf{w}^T\textbf{w} - 1)\\
    \frac{\partial u}{\partial \textbf{w}} &= 2\textbf{Vw} - 2\lambda\textbf{w} \\
    \textbf{Vw} &= \lambda\textbf{w} \\
    \textbf{AA}^T\textbf{w} &= \lambda\textbf{w}
    \end{align}
    This means the maximizing vector will be the eigenvector with the largest eigenvalue. The principal component vector is also a linear combination of the original variables.

    Singular value decomposition

    Singular value decomposition can be expressed as
    $$M = U\Sigma V^*$$

    The column of V are eigenvectors of \(M^*M\). If M is positive semi-definite, the eigenvalue decomposition of M is the same as singular value decomposition. However, the eigenvalue decomposition and the singular value decomposition differ for all other matrices M.