We take into account indoor 3D object detection with respect to a single RGB(-D) body acquired from a commodity handheld machine. We search to considerably advance the established order with respect to each information and modeling. First, we set up that current datasets have vital limitations to scale, accuracy, and variety of objects. Because of this, we introduce the Cubify-Something 1M (CA-1M) dataset, which exhaustively labels over 400K 3D objects on over 1K extremely correct laser-scanned scenes with near-perfect registration to over 3.5K handheld, selfish captures. Subsequent, we set up Cubify Transformer (CuTR), a completely Transformer 3D object detection baseline which reasonably than working in 3D on level or voxel-based representations, predicts 3D packing containers immediately from 2D options derived from RGB(-D) inputs. Whereas this strategy lacks any 3D inductive biases, we present that paired with CA-1M, CuTR outperforms point-based strategies – precisely recalling over 62% of objects in 3D, and is considerably extra succesful at dealing with noise and uncertainty current in commodity LiDAR-derived depth maps whereas additionally offering promising RGB solely efficiency with out structure adjustments. Moreover, by pre-training on CA-1M, CuTR can outperform point-based strategies on a extra numerous variant of SUN RGB-D – supporting the notion that whereas inductive biases in 3D are helpful on the smaller sizes of current datasets, they fail to scale to the data-rich regime of CA-1M. General, this dataset and baseline mannequin present sturdy proof that we’re transferring in direction of fashions which might successfully Cubify Something.