Reputation
Badges 1
25 × Eureka!no need for it actually
Hi RattySeagull0
I'm trying to execute trains-agent in docker mode with conda as package manager, is it supported?
It should, that said we really do not recommend using conda as package manager (it is a lot slower than pip, and can create an environment that will be very hard to reproduce due to internal "compatibility matrix" of conda, that might be changing from one conda version to another)
"trains_agent: ERROR: ERROR: package manager "conda" selected, but 'conda' executable...
You can query the system and get all the experiments based on date, then grab the machine GPU metrics.
DefeatedCrab47 check the cleanup service, it queries the system with the Apiclient.
https://github.com/allegroai/trains/blob/10ec4d56fb4a1f933128b35d68c727189310aae8/examples/services/cleanup/cleanup_service.py#L72
The data I'm syncing by an data provider wich supports only an ftp connection....
Right ... that makes sense :)
No worries WickedGoat98 , feel free to post questions when they arise. BTW: we are now improving the k8s glue, so by the time you get there the integration will be even easier π
If you spin two agent on the same GPU, they are not ware of one another ... So this is expected behavior ...
Make sense ?
PungentLouse55 I'm checking something here, you might stumbled on a bug in parameter overriding. Updating here soon ...
Hi DilapidatedCow43
I'm assuming the returned object cannot be pickled (which is ClearML's way of serializing it)
You can upload it as a model with
` uploaded_model_url = Task.current_task().update_output_model(model_path="/path/to/local/model")
...
return uploaded_model_url `wdyt?
I think I found something,
https://github.com/allegroai/clearml/blob/e3547cd89770c6d73f92d9a05696018957c3fd62/clearml/storage/helper.py#L1442
What's the boto version you have installed?
Hi John. sort of. It seems that archiving pipelines does not also archive the tasks that they contain so
This is correct, the rationale is that the components (i.e. Tasks) might be used (or already used) as cached steps ...
Wait ResponsiveHedgehong88 I'm confused, if you integrated your code with clearml, didn't you to run it manually even once (on any machine, local/remote)?
Hi ScaryBluewhale66
TaskScheduler I created. The status is still
running
. Any idea?
The TaskScheduler needs to actually run in order to trigger the jobs (think cron daemon)
Usually it will be executed on the clearml-agent services queue/mahine.
Make sense ?
I was not able to reproduce with the example code π
https://github.com/allegroai/clearml/blob/master/examples/pipeline/pipeline_from_decorator.py
I execute theΒ
clearml-session
Β withΒ
--docker
Β flag.
This is to control the docker image the agent will spin for you (think dev enviroment you want to work in, like nvidia pytorch container already having everything you need)
Hmm, Notice that it does store sym links to parent data versions (to save on multiple copies of the same file). If you call get_mutable_local_copy() you will get a standalone copy
Hi @<1743079861380976640:profile|HighKitten20>
but when I try to use code stored in a GIT (Bitbucket) repo I got a repository cloning error, specifically
did you pass configure the git repo application/pass here: None
Notice that the new pip syntax:packagename @ <some_link_here>
Is actually interpreted by pip as :
Install "packagename" if it is not installed use the @ "<some_link_here>" to install it.
WickedGoat98 Basically you have two options:
Build a docker image with wget installed, then in the UI specify this image as "Base Docker Image" Configure the trains.conf
file on the machine running the trains-agent, with the above script. This will cause trains-agent
to install wget on any container it is running, so it is available for you to use (saving you the trouble of building your own container).With any of these two, by the time your code is executed, wget is installed an...
It looks somewhat familiar ... π
SuccessfulKoala55 any idea?
Correct π
I'm assuming the Task object is not your Current task, but a different one?
Generally speaking I would say the Nvidia deep-learning AMI:
https://aws.amazon.com/marketplace/pp/prodview-7ikjtg3um26wq
Hi GrotesqueDog77
What do you mean by share resources? Do you mean compute or storage?
How come the second one is one line?
Hi @<1545216070686609408:profile|EnthusiasticCow4>
My biggest concern is what happens if the TaskScheduler instance is shutdown.
good question, follow up, what happens to the cron service machine if it fails?!
TaskScheduler instance is shutdown.
And yes you are correct if someone stops the TaskScheduler instance
it is the equivalent of stopping the cron service...
btw: we are working on moving some of the cron/triggers capabilities to the backend , it will not be as flexi...
Anyhow if the StorageManager.upload was fast, the upload_artifact is calling that exact function. So I don't think we actually have an issue here. What do you think?
JitteryCoyote63
So there will be no concurrent cached files access in the cache dir?
No concurrent creation of the same entry π It is optimized...