Multiple Input-Multiple Output (MIMO) systems are of great interest due to their ability to significantly increase the capacity of wireless communications systems, but for these to be useful they must also be practical for implementation in very large scale integrated (VLSI) circuits. A particularly difficult part of these systems is the detector, where the maximum-likelihood (ML) solution cannot be directly implemented due to its exponential complexity. Lattice decoders, such as the sphere search, exhibit near-ML performance with reduced complexity, but their application is still limited by computational requirements. Here, a number of optimisations are presented, designed to reduce the computational cost of the sphere search in the context of VLSI implementation for MIMO applications. We also propose parallel implementation strategies for such a detector, suitable for implementation in VLSI. This is then combined with a single-pass tree search approach and it is demonstrated that it can be designed so that the error-rate performance is not significantly impaired.