Reputation
Badges 1
25 × Eureka!UnevenDolphin73 are you positive, is this reproducible? What are you getting?
RattySeagull0 I think you are correct, python 3.6 is the installed inside the docker. Is it important to have 3.7 ? You might need another docker (or change the installation script and install python 3.7 inside)
Okay, I'll pass to front-end, see what they can do about it.
Notice that the StorageManager has default configuration here:
https://github.com/allegroai/trains/blob/f27aed767cb3aa3ea83d8f273e48460dd79a90df/docs/trains.conf#L76
Then a per bucket credentials list, with detials:
https://github.com/allegroai/trains/blob/f27aed767cb3aa3ea83d8f273e48460dd79a90df/docs/trains.conf#L81
It is way too much to pass on env variable 😞
The imports inside the functions are because the function itself becomes a stand-alone job running on a remote machine, not the entire pipeline code. This also automatically picks packages to be installed on the remote machine. Make sense?
create inside another task that would again run remotely
This Task will be run on another node, user / permissions will be dealt with by the agent on the other node running the Task
@<1538330703932952576:profile|ThickSeaurchin47> can you try the artifacts example:
None
and in this line do:
task = Task.init(project_name='examples', task_name='Artifacts example', output_uri="
")
nfs version 3
That's the thing, NFS will automatically set file access and flags based on the mount options you cannot change them post mount.
How about creating a new user just for the agent, it makes sense from security / credentials perspective
do I need to have the repo that I am running on my account
If it is a public repo, then no need, credentials are only needed for private repos 🙂
Am I missing something ?
Thank you AttractiveWoodpecker16 !
Removing the uncommitted changes so that you can launch it from an agent? Or is it visual only?
why would root cause the user to become nobody with group nogroup?
It is exactly the case, they inherit the cron service user (uid/gid) which would look like nobody/nogroup
Hi @<1541954607595393024:profile|BattyCrocodile47> and @<1523701225533476864:profile|ObedientDolphin41>
"we're already on AWS, why not use SageMaker?"
TBH, I've never gone through the ML workflow with SageMaker.
LOL I'm assuming this is why you are asking 🙂
- First, you can use SageMaker and still log everything to ClearML (2 lines integration). At least you will have visibility to everything that is running/failing 🙂
- SageMaker job is a container, which means for ...
I can raise this as an issue on the repo if that is useful?
I think this is a good idea, at least increased visibility 🙂
Please do 🙏
Hi CooperativeFly2
is it possible to create multiple train-agent per gpu
Yes you can, that said memory cannot be actually shared between GPU processes (GPU time is obviously shared) so you have to be careful with the Tasks actually being executed in parallel.
For instance:TRAINS_WORKER_NAME=host_a trains-agent daemon --gpus 0 --queue default TRAINS_WORKER_NAME=host_b trains-agent daemon --gpus 0 --queue default
So why is it trying to upload to "//:8081/files_server:" ?
What do you have in the trains.conf on the machine running the experiment ?
But how do you specify the data hyperparameter input and output models to use when the agent runs the experiment
They are autodetected if you are using Argparse / Hydra / python-fire / etc.
The first time you are running the code (either locally or with an agent), it will add the hyper parameter section for you.
That said you can also provide it as part of the clearml-task
command with --args
(btw: clearml-task --help
will list all the options, https://clear.ml/docs/...
Hi RobustGoldfish9 Kudos on the mount, and my apologies for forgetting to mention it.
You are absolutely right, I'll make sure we have it in the documentation, there is no way to know that obscure env variable 🙂
And the agent section on this machine is:api_server:
web_server:
files_server:
Is that correct?
WickedGoat98 Actually the fileserver replied, so it all looks fine to me.
Try to run the text example again, see if you are still getting the fileserver error .
SmarmySeaurchin8 yes, the package containing the Controller is only RC, plan is to release the stable one in a couple of days. In the meantime:pip install git+
Notice this is only when:
Using Conda as package manager in the agent the requested python version is already installed (multiple python version installation on the same machine/container are supported)
TenseOstrich47 this looks like elasticserach is out of space...