No worries, condatoolkit is not part of it. "trains-agent" will create a new clean venv for every experiment, and by default it will not inherit the system packages.
So basically I think you are "stuck" with the cuda drivers you have on the system
Hi MistakenDragonfly51
I'm trying to set
default_output_uri
in
This should be set wither on your client side, or on the worker machine (running the clearml-agent).
Make sense ?
That should not be complicated to implement. Basically you could run 'clearm-task execute --id taskid' as the sagemaker cmd. Can you manually launch it on sagemaker?
ConfusedPig65 could you send the full log (console) of this execution?
JitteryCoyote63 Hmmm in theory, yes.
In practice you need to change this line:
https://github.com/allegroai/clearml/blob/fbbae0b8bc933fbbb9811faeabb9b6d9a0ea8d97/clearml/automation/aws_auto_scaler.py#L78
` python -m clearml_agent --config-file '/root/clearml.conf' daemon --queue '{queue}' {docker} --gpus 0 --detached
python -m clearml_agent --config-file '/root/clearml.conf' daemon --queue '{queue}' {docker} --gpus 1 --detached
python -m clearml_agent --config-file '/root/clearml.conf' d...
Hi GleamingGrasshopper63
How well can the ML Ops component handle job queuing on a multi-GPU server
This is fully supported 🙂
You can think of queues as a way to simplify resources for users (you can do more than that,but let's start simple)
Basicalli qou can create a queue per type of GPU, for example a list of queues could be: on_prem_1gpu, on_prem_2gpus, ..., ec2_t4, ec2_v100
Then when you spin the agents, per type of machine you attach the agent to the "correct" queue.
Int...
Hi DilapidatedDucks58 ,
I'm not aware of anything of this nature, but I'd like to get a bit more information so we could check it.
Could you send the web-server logs ? either from the docker or the browser itself.
🙏 thank you so much @<1556450111259676672:profile|PlainSeaurchin97> !!!
Hi GrotesqueOctopus42 ,
BTW: is it better to post the long error message on a reply to avoid polluting the channel?
Yes, that is appreciated 🙂
Basically logs in the thread of the initial message.
To fix this a had to spin the agent using --cpu-only flag (--docker --cpu-only)
Yes if you do not specify --cpu-only it will default to trying to access gpus
Nice!
I think, this all ties into the none-standard git repo definition. I cannot find any other reason for it. Is it actually stuck for 5 min at the end of the process, waiting for the repo detection ?
Hmm I suspect the 'set_initial_iteration' does not change/store the state on the Task, so when it is launched, the value is not overwritten. Could you maybe open a GitHub issue on it?
OddAlligator72 let's separate the two issues:
Continue reporting from a previous iteration Retrieving a previously stored checkpointNow for the details:
Are you referring to a scenario where you execute your code manually (i.e. without the trains-agent) ?
PompousParrot44
you can always manually store/load models, example: https://github.com/allegroai/trains/blob/65a4aa7aa90fc867993cf0d5e36c214e6c044270/examples/reporting/model_config.py#L35 Sure, you can patch any frame work with something similar to what we do in xgboost, any such PR will be greatly appreciated! https://github.com/allegroai/trains/blob/master/trains/binding/frameworks/xgboost_bind.py
What's the "working directory" ?
What's the trains-agent version?
(yes this should have worked, as long as the package "test" is there)
now it stopped working locally as well
At least this is consistent 🙂
How so ? Is the "main" Task still running ?
Still figuring out, what is the best orchestration tool,which can run this end-2-end.
DeliciousBluewhale87 / PleasantGiraffe85 based on the scenario above what is the missing step that you need to cover? Is it the UI presenting the entire workflow? Or maybe the a start trigger that can be configured ?
but the debug samples and monitored performance metric show a different count
Hmm could you expand on what you are getting, and what you are expecting to get
models been trained stored ...
mongodb will store url links, the upload itself is controlled via the "output_uri" argument to the Task
If None is provided, the Trains log the local stored model (i.e. link to where you stored your model), if you provide one, Trains will automatically upload the model (into a new subfolder) and store the link to that subfolder.
- how can I enable the tensorboard and have the graphs been stored in trains?
Basically if you call Task.init all your...
Yeah I think using voxel for forensics makes sense. What's your use case ?
but I don't see any change...where is the link to the file removed from
In the meta data section, check the artifacts "state" object
How are these two datasets different?
Like comparing two experiments :)
Hmm is this similar to this one https://allegroai-trains.slack.com/archives/CTK20V944/p1597845996171600?thread_ts=1597845996.171600&cid=CTK20V944
ReassuredTiger98 I think it is using moviepy for the encoding... No?
BTW: 0.14.3 solved the issue you are referring to, so you can import trains before / parsing the args without an issue. Regrading passing project/name as parameters. A few thoughts: (1) you can always rename / move projects from the UI (2) If you are running it with trains-agent there is no meaning to these arguments, as by definition the Task was already created... Maybe we should give an option to exclude a few arguments from argparser, I think this topic came up a few times... What d...
or point to the self signed certificate:export REQUESTS_CA_BUNDLE=/path/to/your/certificate.pem
Hi RoundMosquito25
The main problem here is there is no way to know before running the Task how much memory it would need ... And without that parameter maximizing GPUs is quite challenging. wdyt?
Also. finally the columns will be movable and re sizable, I can't wait for the next version ;)