Justin Manweiler, Puneet Jain, and Romit Roy Choudhury
This paper attempts to solve the following problem: can a distant object be localized by looking at it through a smartphone. As an example use-case, while driving on a highway entering New York, we want to look at one of the skyscrapers through the smartphone camera, and compute its GPS location. While the problem would have been far more difﬁcult ﬁve years back, the growing number of sensors on smartphones, combined with advances in computer vision, have opened up important opportunities.
We harness these opportunities through a system called Object Positioning System (OPS) that achieves reasonable localization accuracy. Our core technique uses computer vision to create an approximate 3D structure of the object and camera, and applies mobile phone sensors to scale and rotate the structure to its absolute conﬁguration. Then, by solving (nonlinear) optimizations on the residual (scaling and rotation) error, we ultimately estimate the object’s GPS position.
We have developed OPS on Android NexusS phones and experimented with localizing 50 objects in the Duke University campus. We believe that OPS shows promising results, enabling a variety of applications. Our ongoing work is focused on coping with large GPS errors, which proves to be the prime limitation of the current prototype.
Public Review uploaded by GaetanoBorriello:
This public review was prepared by Gaetano Borriello.
Determining a user's location is a well-known problem in mobile computing. GPS-enabled smartphones have done much to solve this problem - even if accuracy is still somewhat limited when large parts of the sky are obstructed. This paper attempts to solve an interesting new twist on this problem, namely, determining the geo-location of an object in the distance by using the phone's camera. This capability would be very useful in navigation and tourism. Generally, it provides a way to tag objects with information a user can access as long as the object is observable from their vantage point. Of course, we are talking about large objects such as buildings, bridges, mountains, billboards, etc., not the streetscapes of other work that could assume a basic scale and comparison to a large collection of photos such as in StreetView.
The paper brings together a set of techniques from the computer vision community. Specifically, it leverages 'structure from motion' as well as information from the phone's inertial sensors to triangulate an object from a small set of photographs taken a few paces apart from each other. The authors did some interesting work to integrate all this data so as to minimize the location error given significant errors from the various sensors. They demonstrate reasonable accuracy using building on a campus. Once the geo-location is determined, information about the object should be delivered to the user.
The paper stood out during the review process because it enables an entirely new capability. The methods try to be general and not rely on special landmarks in a pre-organized database. The limitations of the procedure are presented honestly and the early achievements with this technique are not overblown. That said, there are many practical issues with the approach including the difficulty of using the methods at night or on cloudy days when features on distant objects are dulled by more diffuse lighting, or how to deal with making the user aware of errors that will retrieve the wrong information about the wrong object - and enable the user to correct them through manual correction/selection or by taking more photos.
Thank you to Gaetano Borriello for your comments, and generous help throughout the shepherding process.
We broadly agree with the comments in this review. Primarily, the value of this paper is in defining this new mobile localization problem and making substantial, if not complete, progress in solving it. The problem is a challenging one, and as we describe in the paper, necessitated experimentation with a disparate set of techniques before arriving at our eventual solution.
We agree with the summary of limitations. Yes, the techniques we have borrowed from computer vision are hinged on the quality of photographs. Further, yes, accuracy is still shy of fully-practical use. In spite of these challenges, we believe our techniques are valid, valuable, and can broadly form the basis of a mature solution more-fully addressing problematic edge cases. In the discussion section, we have outlined some promising extensions for building such a mature solution. For example, as mentioned in the review, a more sophisticated user interface could provide live feedback for helping the user take "good" photographs. Other enhancements are less straightforward, and leave the research community ample opportunities for exploration.