Associating objects accurately across cameras and frames is essential but challenging for vision-based perception in autonomous driving system. In a vanilla tracking-by-detection fashion, most prior works associate detected objects over views and time via a great many of heuristic matching rules. In this work, we propose a simple yet efficient method named EMAT, or End-to-end Multi-view Association Tracker, which jointly performs 3D detection and tracking from multi-camera and multi-frame images in an end-to-end manner. Our key design is to predict the object appearing information and affinity of each object in sequential frames from temporally fused object query embedding features, which extracted from a temporal fusion module designed for learning to associate. Without any post-process, 3D tracklets are built up across frames, along with 3D detection and velocity estimation. Additionally, we propose a novel strategy to boost object velocity estimation by the information of object appearing. Experiments on the large-scale nuScenes dataset demonstrate that our approach outperforms the 3D detection baseline we build upon, achieving superior camera-based 3D tracking and velocity estimation performance. Additionally, it surpasses traditional 3D tracking methods, showcasing its effectiveness in real-world scenarios.