Answered

Hi Everyone, Is It Possible To Show The Upload Progress Of Artificats? E.G. I Use

Hi everyone, is it possible to show the upload progress of artificats? E.g. I use torch.save to store some very large model, so it hangs forever when it uploads the model. Is there some flag to show a progress bar?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Votes Newest

Answers 26

I guess this is from clearml-server and seems to be bottlenecking artifact transfer speed.

I'm assuming you need multiple "file-server" instances running on the "clearml-server" with a load-balancer of a sort...

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Seems more like a bug or something is not properly configured on my side.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

` # Connecting ClearML with the current process,

from here on everything is logged automatically

task = Task.init(project_name="examples", task_name="artifacts example")
task.set_base_docker(
"my_docker",
docker_arguments="--memory=60g --shm-size=60g -e NVIDIA_DRIVER_CAPABILITIES=all",
)

if not running_remotely():
task.execute_remotely("docker", clone=False, exit_process=True)

timer = Timer()
with timer:
# add and upload Numpy Object (stored as .npz file)
task.upload_artifact("Numpy Eye", np.eye(100000, 100000))

print(timer.duration)

we are done

print("Done") `

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

481.2130692792125 seconds

This is very slow.
It makes no sense, it cannot be network (this is basically http post, and I'm assuming both machines on the same LAN, correct ?)
My guess is the filesystem on the clearml-server... Are you having any other performance issues ?
(I'm thinking HD degradation, which could lead to a slow write speeds, which would effect the Elastic/Mongo as well)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

server-->agent is fast, but agent-->server is slow.

Then multiple connection will not help, this is the bottleneck of the upload speed of your machine, regardless of what the target is (file-server, S3, etc...)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Simple file transfer test gives me approximately 1 GBit/s transfer rate between the server and the agent, which is to be expected from the 1Gbit/s network.

Ohhh I missed that. What is the speed you get for uploading the artifacts to the server? (you can test it with simple toy artifact upload code) ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Artifact Size: 74.62 MB

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

It is only a single agent that is sending a single artifact. server-->agent is fast, but agent-->server is slow.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

The agent and server have similar hardware also. So I would expect same read/write speed.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Agent runs in docker mode. I ran the agent on the same machine as the server this time.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

I see a python 3 fileserver.py running on a single thread with 100% load.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Yea, it was finished after 20 hours. Since the artifact started uploading when the experiment finishes otherwise, there is no reporting for the the time where it uploaded. I will debug it and report what I find out

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

ReassuredTiger98 after 20 hours, was it done uploading ?
What do you see in the Task resource monitoring? (notice there is network_tx_mbs metric that should be accordig to this, 0.152)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

481.2130692792125 seconds
Done

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Hi ReassuredTiger98 ,
Do you see the Starting upload:... message in the log?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

An upload of 11GB took around 20 hours which cannot be right.

That is very very slow this is 152kbps ...

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I guess this is from clearml-server and seems to be bottlenecking artifact transfer speed.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Yea, correct! No problem. Uploading such large artifacts as I am doing seems to be an absolute edge case 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Yea, and the script ends with clearml.Task - INFO - Waiting to finish uploads

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

An upload of 11GB took around 20 hours which cannot be right. Do you have any idea whether ClearML could have something to do with this slow upload speed? If not I am going to start debugging with the hardware/network.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

ReassuredTiger98 is it possible the fileserver component's data folder mount is incorrect? This would mean the docker FS is used and can maybe account for the low performance?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

So my network seems to be fine. Downloading artifacts from the server to the agents is around 100 MB/s, while uploading from the agent to the server is slow.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

AgitatedDove14 Yea, I also had this problem: https://github.com/allegroai/clearml-server/issues/87 I have Samsung 970 Pro 2TB on all machines, but maybe something is missconfigured like SuccessfulKoala55 suggested. I will take a look. Thank you for now!

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

But it is not related to network speed, rather to clearml. I simple file transfer test gives me approximately 1 GBit/s transfer rate between the server and the agent, which is to be expected from the 1Gbit/s network.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

As far as I know the automatic binding uses async upload, which should be verbose

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I use

torch.save

to store some very large model, so it hangs forever when it uploads the model. Is there some flag to show a progress bar?

I'm assuming the upload is http upload (e.g. the default files server)?
If this is the case, the main issue we do not have callbacks on http uploads to update the progress (which I would love a PR for, but this is actually a "requests" issue)
I think we had a draft somewhere, but I'm not sure ...

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

26 Answers

3 years ago

one year ago