I haven’t looked into the issue of PCIe lanes and the GPU.
I don’t think it should matter with a smaller PCIe bus, in theory, if I understand correctly (unlikely). The only time a lot of data is transferred is when the model layers are initially loaded. Like with Oobabooga when I load a model, most of the time my desktop RAM monitor widget does not even have the time to refresh and tell me how much memory was used on the CPU side. What is loaded in the GPU is around 90% static. I have a script that monitors this so that I can tune the maximum number of layers. I leave overhead room for the context to build up over time but there are no major changes happening aside from initial loading. One just sets the number of layers to offload on the GPU and loads the model. However many seconds that takes is irrelevant startup delay that only happens once when initiating the server.
So assuming the kernel modules and hardware support the more narrow bandwidth, it should work… I think. There are laptops that have options for an external FireWire GPU too, so I don’t think the PCIe bus is too baked in.
deleted by creator