Reputation
Badges 1
25 × Eureka!That seems like the k8s routing, can you try the web server curl?
JitteryCoyote63 virtualenv v20 is supported, pip v21 needs the latest trains/trains-agent RC,
Hmm what's the clearml version? Whats the python version, whats the OS? And pytorch version?
Really what I need is for A and B to be separate tasks, but guarantee they will be assigned to the same machine so that the clearml dataset cache on that machine will be warm.
I think that what you are looking for is multi-machine cache (which is fully supported). Basically mount an NFS/SMB folder from a NAS to any of those machines, configure the cache folder to point to it, and not you do not need to worry about affinity ?
no?
Is there a way to group A and B into a sub-pipeline, h...
which to my understanding has to be given before a call to an argparser,
SmarmySeaurchin8 You can call argparse before Task.init, no worries it will catch the arguments and trains-agent will be able to override them :)
WackyRabbit7
Cool - so that means the fileserver which comes with the host will stay emtpy? Or is there anything else being stored there?
Debug Images and artifacts will be automatically stored to the file server.
If you want your models to be automagically uploaded add the following :task=Task.init('example', 'experiment', output_uri=' ')(You can obviously point it to any other http/S3/GS/Azure storage)
Hi @<1545216070686609408:profile|EnthusiasticCow4> let me know if this one solves the issue
pip install clearml==1.14.2rc0
each epoch runs about 55 minutes, and that screenshot I posted earlier kind of show the logs for the rest of the info being output, if you wanted to check that out
I thought you disabled the stdout log. no?
Maybe ClearML is using
tensorboard
in ways that I can fine tune? I
You can open your TB and see, every report there is logged into clearml
Hi UptightBeetle98
The hyper parameter example assumes you have agents ( trains-agent ) connected to your account. These agents will pull the jobs from the queue (which they are now, aka pending) setup the environment for the jobs (venv or docker+venv) and execute the job with the specific arguments the optimizer chose.
Make sense ?
It doesn't not seem to be related to the upload. The upload itself finished... What's your Trains version?
Hi PanickyMoth78
So do not tell anyone, but the next version will have reports built in clearml, as well as the ability to embed graphs in 3rd party (think Notion GitHub, markdown etc.)
Until then (ETA mid Dec), the easiest is to download an image or just use the url (it encodes the full view, so when someone clicks on it they the exact view you are seeing)
Yea I know, I reported this
LOL, apologies these days it a miracle I still remember my login passwords π
@<1558624430622511104:profile|PanickyBee11> how are you launching the code on multiple machines ?
are they all reporting to the same Task?
Hi RipeGoose2
Could you expand on "inconsistency in the iteration reporting" ? Also "calling trainer.fit multiple" would you expect it to show as a single experiment or is it kind of param search ?
You mean why you have two processes ?
Could you send me the cosnole log of both tasks, failing and passing one?
Hi SharpDove45
whatΒ
Β suggested about how it fails on bad/missing credentials
Yes, this is correct, since you specifically set the hosts worst case you will end up with wrong credentials π
BoredHedgehog47
is this ( https://clearml.slack.com/archives/CTK20V944/p1665426268897429?thread_ts=1665422655.799449&cid=CTK20V944 ) the same issue (or solution) ?
Hi VexedCat68
Could it be the python version is not the same? (this is the only reason not to find a specific python package version)
How does
deferred_init
affect the process?
It ders all the networking and stuff in the background (usually the part that might slow the Task initialization process)
Also, is there a way of specifying a blacklist instead of a whitelist of features?
BurlyPig26 you can while list per framework and file name, exampletask = Task.init(..., auto_connect_frameworks={'pytorch' : '*.pt', 'tensorflow': ['*.h5', '*.hdf5']} )What am I missing ?
Yes, let's assume we have a task with id aabbcc
On two different machines you can do the following:trains-agent execute --docker --id aabbccThis means you manually spin two simultaneous copies of the same experiment, once they are up and running, will your code be able to make the connection between them? (i.e. openmpi torch distribute etc?)
ReassuredTiger98 no, but I might be missing something.
How do you mean project-specific?
What's the error you are getting ?
(open the browser web developer, see if you get something on the console log)
The agents are docker containers, how do I modify the startup script so it creates a queue?
Hmm actually not sure about that, might not be part of the helm chart.
So maybe the easiest is:from clearml.backend_api.session.client import APIClient c = APIClient() c.queues.create(name="new_queue")
Yep π
Also maybe worth changing the entry point of the agent docker to always create a queue if it is missing?
(I mean new logs, while we are here did it report any progress)
Hi ZippyAlligator65
You can configure it in the clearml.conf: see here:
https://github.com/allegroai/clearml-agent/blob/ebb955187dea384f574a52d059c02e16a49aeead/clearml_agent/backend_api/config/default/agent.conf#L202
Hi HelpfulDeer76
I mean that the task was being monitored on the demo ClearML server created by Allegro
Yes that is consistent with what I would expect to have happened
Basically if you are running it as k8s job, you can just configure the following environment variables:CLEARML_WEB_HOST: CLEARML_API_HOST: CLEARML_FILES_HOST: CLEARML_API_ACCESS_KEY: <clearml access> CLEARML_API_SECRET_KEY: <clearml secret>