After the arm is calibrated, the pose and structure of
the Lincoln Log
assembly are determined, then the structure, whose pose and contents
have been determined solely from visual input, is disassembled.
Disassembly of a Lincoln
Log structure from multiple views and linguistic input
After the arm is calibrated, the pose and structure of
the Lincoln Log
assembly are determined. One view is insufficient to disambiguate the
structure but the structure is disambiguated in concert with a second
view. Then the second view is forgotten and a linguistic constraint
is applied, which is also able to disambiguate the structure. The
structure thus determined is then disassembled.
Alternative content
Lincoln Log structure estimation from a single image
Once we have determined the pose of a Lincoln Log
assembly (left) we can correctly determine the types and positions of the logs (shown in green) that
constitute the assembly (right).
Structure estimation from spatially distinct views
Due to occlusion, a single view (left) may provide insufficient information
to support correct structure estimation (the false negative shown in orange).
Integrating information from a second view (right) of the same structure,
prior to disassembly, can correct the error.
Correctly determined absence of logs is shown in blue.
Structure estimation from temporally distinct views
Another way to recover occluded information is to begin the task of
disassembly with partial information (left) and then reimage the structure from
the same camera pose part-way through disassembly after the occlusion has been
eliminated.
The information from two temporally distinct views of distinct assembly states
can be integrated to yield a correct model of the initial structure (right).
Structure estimation given constraints
Occluded information can be recovered from a single image by constraining the
space of possible structures (in this case specification of the piece
inventory).
Our goal is for multiple agents to communicate such constraints linguistically
and infer such constraints through high-level reasoning.
From language to motor control to vision to language
One of the reasons for grounding language is to allow the transfer of
information between modalities. Language can drive motor control, we
give the system a sentence (Seven doors exist), it finds a
structure which satisfies it, which after being built is recognized by
the vision system and produces the same sentence in the end. "Seven
doors exist" fully describes this structure given its size.