Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi Everyone, Is It Possible To Show The Upload Progress Of Artificats? E.G. I Use

Hi everyone, is it possible to show the upload progress of artificats? E.g. I use torch.save to store some very large model, so it hangs forever when it uploads the model. Is there some flag to show a progress bar?

  
  
Posted 3 years ago
Votes Newest

Answers 26


I guess this is from clearml-server and seems to be bottlenecking artifact transfer speed.

I'm assuming you need multiple "file-server" instances running on the "clearml-server" with a load-balancer of a sort...

  
  
Posted 3 years ago

Seems more like a bug or something is not properly configured on my side.

  
  
Posted 3 years ago

` # Connecting ClearML with the current process,

from here on everything is logged automatically

task = Task.init(project_name="examples", task_name="artifacts example")
task.set_base_docker(
"my_docker",
docker_arguments="--memory=60g --shm-size=60g -e NVIDIA_DRIVER_CAPABILITIES=all",
)

if not running_remotely():
task.execute_remotely("docker", clone=False, exit_process=True)

timer = Timer()
with timer:
# add and upload Numpy Object (stored as .npz file)
task.upload_artifact("Numpy Eye", np.eye(100000, 100000))

print(timer.duration)

we are done

print("Done") `

  
  
Posted 3 years ago

481.2130692792125 seconds

This is very slow.
It makes no sense, it cannot be network (this is basically http post, and I'm assuming both machines on the same LAN, correct ?)
My guess is the filesystem on the clearml-server... Are you having any other performance issues ?
(I'm thinking HD degradation, which could lead to a slow write speeds, which would effect the Elastic/Mongo as well)

  
  
Posted 3 years ago

server-->agent is fast, but agent-->server is slow.

Then multiple connection will not help, this is the bottleneck of the upload speed of your machine, regardless of what the target is (file-server, S3, etc...)

  
  
Posted 3 years ago

Simple file transfer test gives me approximately 1 GBit/s transfer rate between the server and the agent, which is to be expected from the 1Gbit/s network.

Ohhh I missed that. What is the speed you get for uploading the artifacts to the server? (you can test it with simple toy artifact upload code) ?

  
  
Posted 3 years ago

Artifact Size: 74.62 MB

  
  
Posted 3 years ago

It is only a single agent that is sending a single artifact. server-->agent is fast, but agent-->server is slow.

  
  
Posted 3 years ago

The agent and server have similar hardware also. So I would expect same read/write speed.

  
  
Posted 3 years ago

Agent runs in docker mode. I ran the agent on the same machine as the server this time.

  
  
Posted 3 years ago

I see a python 3 fileserver.py running on a single thread with 100% load.

  
  
Posted 3 years ago

Yea, it was finished after 20 hours. Since the artifact started uploading when the experiment finishes otherwise, there is no reporting for the the time where it uploaded. I will debug it and report what I find out

  
  
Posted 3 years ago

ReassuredTiger98 after 20 hours, was it done uploading ?
What do you see in the Task resource monitoring? (notice there is network_tx_mbs metric that should be accordig to this, 0.152)

  
  
Posted 3 years ago

481.2130692792125 seconds
Done

  
  
Posted 3 years ago

Hi ReassuredTiger98 ,
Do you see the Starting upload:... message in the log?

  
  
Posted 3 years ago

An upload of 11GB took around 20 hours which cannot be right.

That is very very slow this is 152kbps ...

  
  
Posted 3 years ago

I guess this is from clearml-server and seems to be bottlenecking artifact transfer speed.

  
  
Posted 3 years ago

Yea, correct! No problem. Uploading such large artifacts as I am doing seems to be an absolute edge case 🙂

  
  
Posted 3 years ago

Yea, and the script ends with clearml.Task - INFO - Waiting to finish uploads

  
  
Posted 3 years ago

An upload of 11GB took around 20 hours which cannot be right. Do you have any idea whether ClearML could have something to do with this slow upload speed? If not I am going to start debugging with the hardware/network.

  
  
Posted 3 years ago

ReassuredTiger98 is it possible the fileserver component's data folder mount is incorrect? This would mean the docker FS is used and can maybe account for the low performance?

  
  
Posted 3 years ago

So my network seems to be fine. Downloading artifacts from the server to the agents is around 100 MB/s, while uploading from the agent to the server is slow.

  
  
Posted 3 years ago

AgitatedDove14 Yea, I also had this problem: https://github.com/allegroai/clearml-server/issues/87 I have Samsung 970 Pro 2TB on all machines, but maybe something is missconfigured like SuccessfulKoala55 suggested. I will take a look. Thank you for now!

  
  
Posted 3 years ago

But it is not related to network speed, rather to clearml. I simple file transfer test gives me approximately 1 GBit/s transfer rate between the server and the agent, which is to be expected from the 1Gbit/s network.

  
  
Posted 3 years ago

As far as I know the automatic binding uses async upload, which should be verbose

  
  
Posted 3 years ago

I use 

torch.save

 to store some very large model, so it hangs forever when it uploads the model. Is there some flag to show a progress bar?

I'm assuming the upload is http upload (e.g. the default files server)?
If this is the case, the main issue we do not have callbacks on http uploads to update the progress (which I would love a PR for, but this is actually a "requests" issue)
I think we had a draft somewhere, but I'm not sure ...

  
  
Posted 3 years ago
1K Views
26 Answers
3 years ago
one year ago
Tags