Parsing information from Google Street View image

Reference Paper:

http://vision.ucsd.edu/sites/default/files/sean_cities.pdf

Data Sets Build Up

1. Data Collection:

The difficulty with Flickr and other consumer photo-sharing websites for geographical tasks is that there is a strong data bias towards famous landmarks. To correct for this bias and provide a more uniform sampling of the geographical space, we turn to GOOGLE STREET VIEW – a huge database of street-level imagery, captured as panoramas using specially-designed vehicles. This enables extraction of roughly fronto-parallel views of building facades and, to some extent, avoids dealing with large variations of camera viewpoint.

Google Street View data sets building: http://cmp.felk.cvut.cz/ftp/articles/gronat/Gronat-TR-2011-16.pdf
Use panoramas image to reconstruct the city.

2. Data Classification Method:

a. Main Idea:

We propose an approach that avoids partitioning the entire feature space into clusters. Instead, we start with a large number of randomly sampled candidate patches(Seed of the cluster), and then give each candidate a chance to see if it can converge to a cluster that is both frequent(frequently occurring within the given locale) and discriminative(geographically discriminative, doesn’t occur much in other city), which is labeled as positive. We first compute the nearest neighbors of each candidate, and reject candidates with too many neighbors in the negative set. Then we gradually build clusters by applying iterative discriminative learning(SVM Learning) to each surviving candidate:
http://graphics.cs.cmu.edu/projects/whatMakesParis/paris_sigg_reduced.pdf

First, the initial geo-informativeness of each patch(a large number of randomly sampled candidate patches) is estimated by finding the top 20 nearest neighbor (NN) patches in the full dataset (both positive and negative), measured by normalized correlation. Patches portraying non-discriminative elements tend to match similar elements in both positive and negative set, while patches portraying a non-repeating element will have more-or-less random matches, also in both sets. Thus, we keep the candidate patches that have the highest proportion of their nearest neighbors in the positive set, while also rejecting near-duplicate patches (measured by spatial overlap of more than 30% between any 5 of their top 50 nearest neighbors). This reduces the number of candidates to about 1000.
Then iterative the SVM learning(Figure3 row2 - row4).

b. Implementation Details:

The implementation considers only square patches (although it would not be difficult to add other aspect ratios), and takes patches at scales ranging from 80-by-80 pixels all the way to height-of-image size. Patches are represented with standard HOG [Dalal and Triggs 2005] (8x8x31 cells), plus a 8x8 color image in Lab colorspace (a and b only). Thus the resulting feature has 8x8x33 = 2112 dimentions. During iterative learning, we use a soft-margin SVM with C fixed to 0.1. The full mining computation is quite expensive; a single city requires approximately 1, 800 CPU-hours. But since the algorithm is highly parallelizable, it can be done overnight on a cluster.

c. Reference:

Calculate the HOG+color descriptor then use SVM to train the model:
https://hal.inria.fr/file/index/docid/548512/filename/hog_cvpr2005.pdf

HOG Learning: http://blog.sina.com.cn/s/blog_60e6e3d50101bkpn.html

SVM Learning: http://blog.pluskid.org/?page_id=683