Hi UnsightlyBeetle11
Is it possible to report the model's architecture (PyTorch model) automatically on ClearML, as we do it via Netron or other neural network visualisation tools?You mean like the actual network layout? Unfortunately, there is currently no option to do that, you can however manually store a plot/image that represents it
BTW:I think that at the beginning Netron was somehow integrated, but it was rarely used and support for it was not trivial so it was phased out. You can ho...
You can set torch to be installed last:
post_packages: ["horovod", "torch"]
Which will make sure the "trains-agent" version (the one you specified in the "installed packages" will be installed last.
no need for it actually
HighOtter69 inside the legend click on the color rectangle next to the series name, you can change the color of the series on the graph. This property is stored so it will always remember your color preferences (yes even logging from another machine 🙂 )
create inside another task that would again run remotely
This Task will be run on another node, user / permissions will be dealt with by the agent on the other node running the Task
ExcitedFish86 this is a general "dummy agent" that tasks and executes them (no env created, no code cloned, as you suggested)
hows does this work with HPO?
The HPO clones Tasks, changes arguments, push them into a queue, and monitors the metrics in real time. The missing part (from my understanding) was the the execution of the Tasks themselves required setup, and that you wanted multiple machine support, in order to overcome it, I post a dummy agent that just runs the Tasks.
(Notice...
Like get the tasks that uses the most metrics API?
LOl my pleasure - I guess we should have a link in the doc string of add_requirements
to set_packages
, I will tell the guys
I can verify the behavior, I think it has to do with the way the subparser was setup.
This was the only way for me to get it to run:script.py test blah1 blah2 blah3 42
When I passed specific arguments (for example --steps) it ignored them...
BitterStarfish58 I would suspect the upload was corrupted (I think this is the discrepancy between the files size logged, to the actual file size uploaded)
@<1539780258050347008:profile|CheerfulKoala77> make sure the AMI id matches the zone of the EC2 machine
DefeatedCrab47 If I remember correctly v1+ has their arguments coming from argparse .
Are you using this feature ? 2. How do you set the TB HParam ? Currently Trains does not support TB HParams, the reason is the set of HParams needs to match a single experiment. Is that your case?
Hi ScantChimpanzee51
having the ClearML auto scaler at all is super great and an impressive tool!
Thank you! 😍
As all data resides within the container, it is lost afterwards.
Nothing to fear there, if you are using the StorageManager, the destination is always the cache folder, which the agent automatically mounts to the host machine.
That said if the EC2 instance is taken down (i.e. idle) then the cache is lost with it.
Make sense?
Kind of as it tries to do "apt-get install"...
what did you have in mind ?
Awesome! any way to hear the talk w/o/ registering for the whole conference?
CloudySwallow27 Anyway we will make sure we upload the talk to the clearml youtube channel after the Talk
I think you can watch it after GTC on the nvidia website, and a week after that we will be able to upload it to the youtube channel 🙂
Was trying to figure out how the method knows that the docker image ID belongs to ECR. Do you have any insight into that?
Basically you should have the docker service login before running the agent, then the agent uses docker to run the image from the ECR.
Make sense ?
BTW
Grafana Visualizing endpoint request latency as well as prediction result value distributions
Queues can have multiple workers, and that implies multiple instances of a task can run concurrently.
@<1533619716533260288:profile|SmallPigeon24> as long as these are the Exact same instances you can have them runing simultaneously (think multi node training), that said each one should "know" not to report over the others, because of course it will overwrite the reports.
Back to your point on multiple agents:
You cannot have two Tasks in the same queue, that means that a single agen...
- Maybe we should add an option, archive components as well ...
PompousHawk82 unfortunately this is kind of binary, either you have full tracking of load/save operations or you do not.
This warning message will disappear in the next version as we will be able to log multiple models under the same Task :)
An example for something like spacy would be useful for the community.
That awesome, any chance you can PR something? (no need for it to be perfect, we can take it from there)
Hi @<1695969549783928832:profile|ObedientTurkey46>
Use --services-mode in the agent , it will run many Tasks on the same machine, this is usually associated with the services queue, but can be run on any queue. This way you could have the same machine easily running those multiple "control" tasks.
wdyt?
Hmm, so this is kind of a hack for ClearML AWS autoscaling ?
and every instance is running an agent? or a single Task?
Hi MortifiedCrow63
I have to admit this is very strange, I think the fact it works for the artifacts and not for the model is kind of a fluke ...
If you use "wait_on_upload" argument in the upload_artifact you end up with the same behavior. Even if uploaded in the background, the issue is still there, for me it was revealed the minute I limited the upload bandwidth to under 300kbps.It seems the internal GS timeout assumes every chunk should be uploaded in under 60 seconds.
The default chunk...