Multiuser detection (MUD) strategies have the potential to significantly increase the capacity of wireless communications systems, but for these to be useful they must also be practical for implementation in very large scale integrated (VLSI) circuits. In particular, while the maximum-likelihood (ML) solution is optimal in bit error rate, it cannot be directly implemented due to its exponential computational complexity. Lattice decoders, such as the sphere search, exhibit near-optimal ML performance with reduced complexity, but their application is still limited by computational requirements. Here, we present a number of optimisations designed to reduce the computational cost of the sphere search, in the context of VLSI implementation. We also propose parallel implementation strategies for such a detector, suitable for implementation in VLSI. We then combine this with a single-pass tree search approach that can be designed to not significantly impair error-rate performance. While our design is targeted towards a MUD application, the concepts may also be applied to a multiple-input multiple-output (MIMO) system, or similar applications.