Hi ReassuredTiger98 ,
Do you see the Starting upload:...
message in the log?
As far as I know the automatic binding uses async upload, which should be verbose
Yea, and the script ends with clearml.Task - INFO - Waiting to finish uploads
I use
torch.save
to store some very large model, so it hangs forever when it uploads the model. Is there some flag to show a progress bar?
I'm assuming the upload is http upload (e.g. the default files server)?
If this is the case, the main issue we do not have callbacks on http uploads to update the progress (which I would love a PR for, but this is actually a "requests" issue)
I think we had a draft somewhere, but I'm not sure ...
Yea, correct! No problem. Uploading such large artifacts as I am doing seems to be an absolute edge case 🙂
An upload of 11GB took around 20 hours which cannot be right. Do you have any idea whether ClearML could have something to do with this slow upload speed? If not I am going to start debugging with the hardware/network.
An upload of 11GB took around 20 hours which cannot be right.
That is very very slow this is 152kbps ...
ReassuredTiger98 after 20 hours, was it done uploading ?
What do you see in the Task resource monitoring? (notice there is network_tx_mbs
metric that should be accordig to this, 0.152)
Yea, it was finished after 20 hours. Since the artifact started uploading when the experiment finishes otherwise, there is no reporting for the the time where it uploaded. I will debug it and report what I find out
So my network seems to be fine. Downloading artifacts from the server to the agents is around 100 MB/s, while uploading from the agent to the server is slow.
I see a python 3 fileserver.py
running on a single thread with 100% load.
I guess this is from clearml-server and seems to be bottlenecking artifact transfer speed.
I guess this is from clearml-server and seems to be bottlenecking artifact transfer speed.
I'm assuming you need multiple "file-server" instances running on the "clearml-server" with a load-balancer of a sort...
It is only a single agent that is sending a single artifact. server-->agent is fast, but agent-->server is slow.
Seems more like a bug or something is not properly configured on my side.
server-->agent is fast, but agent-->server is slow.
Then multiple connection will not help, this is the bottleneck of the upload speed of your machine, regardless of what the target is (file-server, S3, etc...)
But it is not related to network speed, rather to clearml. I simple file transfer test gives me approximately 1 GBit/s transfer rate between the server and the agent, which is to be expected from the 1Gbit/s network.
The agent and server have similar hardware also. So I would expect same read/write speed.
Simple file transfer test gives me approximately 1 GBit/s transfer rate between the server and the agent, which is to be expected from the 1Gbit/s network.
Ohhh I missed that. What is the speed you get for uploading the artifacts to the server? (you can test it with simple toy artifact upload code) ?
` # Connecting ClearML with the current process,
from here on everything is logged automatically
task = Task.init(project_name="examples", task_name="artifacts example")
task.set_base_docker(
"my_docker",
docker_arguments="--memory=60g --shm-size=60g -e NVIDIA_DRIVER_CAPABILITIES=all",
)
if not running_remotely():
task.execute_remotely("docker", clone=False, exit_process=True)
timer = Timer()
with timer:
# add and upload Numpy Object (stored as .npz file)
task.upload_artifact("Numpy Eye", np.eye(100000, 100000))
print(timer.duration)
we are done
print("Done") `
Agent runs in docker mode. I ran the agent on the same machine as the server this time.
481.2130692792125 seconds
This is very slow.
It makes no sense, it cannot be network (this is basically http post, and I'm assuming both machines on the same LAN, correct ?)
My guess is the filesystem on the clearml-server... Are you having any other performance issues ?
(I'm thinking HD degradation, which could lead to a slow write speeds, which would effect the Elastic/Mongo as well)
ReassuredTiger98 is it possible the fileserver component's data folder mount is incorrect? This would mean the docker FS is used and can maybe account for the low performance?
AgitatedDove14 Yea, I also had this problem: https://github.com/allegroai/clearml-server/issues/87 I have Samsung 970 Pro 2TB on all machines, but maybe something is missconfigured like SuccessfulKoala55 suggested. I will take a look. Thank you for now!