Because it has fewer parameters and (in some cases) it’s quantized. The hardware needed to run local inference on the full model is not really feasible to most people. Though, the release of it will probably still make a wide impact on the quality of other upcoming smaller models being distilled from it, or trained on synthetic data from it, or merged with it, etc.
Why is that? I mean why does the locally run version suck?
Because it has fewer parameters and (in some cases) it’s quantized. The hardware needed to run local inference on the full model is not really feasible to most people. Though, the release of it will probably still make a wide impact on the quality of other upcoming smaller models being distilled from it, or trained on synthetic data from it, or merged with it, etc.