Today, Cerebras Systems launched its new AI inference solution called Cerebras Inference which the company stated is the fastest AI solution in the world. That means it provides 1,800 tokens in a second for Llama 3.1 in the context of this solution. For, 1850 tokens per second for Llama 3.1 8B, and 450 tokens per second. 1 70B of model capacity, which is 20X faster than NVIDIA GPU-based hyperscale clouds.
Llama 3.1 Tokens Verified By Cerebas
Unlike other approaches that reduce accuracy for the benefit of optimizing performance, Cerebras provides high levels of performance while, at the same time, achieving levels of accuracy that are of leading industry standards by, therefore, keeping the process of inference fully in the 16-bit system.
While Cerebras scored well on all the benchmarks, Inference costs a much fraction of its GPU-based rivals, with 10 cents per million tokens for Llama 3.1 under the pay-as-you-go model. Five cents per token per million tokens for Version 1 8B and 60 cents per million tokens for Llama 3.1 70B.
Cerebras solves the inherent memory bandwidth problems with the GPUs where the models have to be brought to the compute cores for every output token. Thereby, there is a limitation to the speed of inference especially for larger models such as Llama 3.1 70B with 70 billion parameters and 140GB of memory inline size.
Apart from the performance promises, Cerebras is marketing its service as cheaper than existing solutions out there. The company said its service begins at $0.10 per million tokens, a cost-performance that it said is 100x higher for AI inference.