Reputation
Badges 1
25 × Eureka!Hi @<1528908687685455872:profile|MassiveBat21>
However
no useful
template
is created for down stream executions - the source code template is all messed up,
Interesting, could you provide the code that is "created", or even better some way to reproduce it ? It sounds like sort of a bug? or maybe a feature support that is missing.
My question is - what is a best practice in this case to be able to run exported scripts (python code not made availa...
This should have worked with the latest clearml RC.
And you verified it is not working?
Thanks MagnificentSeaurchin79 !
Let me check what's the status with this one, could it be the same as this one?
https://github.com/allegroai/clearml/issues/322
should i only do mongodb
No, you should do all 3 DBs ELK , Mongo, Redis
GiganticTurtle0 adding --stop to the exact daemon execution will stop it (meaning if you have multiple agents on the same machine launched with different parameters, just add the --stop to retire the specific one)
Hmmm that is a good use case to have (maybe we should have --stop get an argument ?)
Meanwhile you can do$ clearml-agent daemon --gpus 0 --queue default $ clearml-agent daemon --gpus 1 --queue default then to stop only the second one: $ clearml-agent daemon --gpus 1 --queue default --stop
wdyt?
GiganticTurtle0 can you please add a github issue with feature request to clearml-agent? I think this is a great use case!
SmarmySeaurchin8 it could be a switch, the problem is that when you have automatic stopping flows, they will abort a task, which is legitimate (e.g. should not considered failed)
How come you have aborted tasks in the pipeline ? If you want to abort the pipeline, you need to first abort the pipeline Task then the tasks themselves.
Why? The task should have completed successfully, how is this aborting?
Early stopping by the HPO process, like hyper-band, e.g. this training model is going nowhere let's stop it.
JitteryCoyote63 are you calling to:my_task.output_uri = "
s3://my-bucket
in the code itself ?
Why not with Task.init output_uri=...
Also this is running remotely there is no need fo r that, use the Execution -> Output -> Destination and put it there, it will do everything for you 🙂
the second seems like a botocore issue :
https://github.com/boto/botocore/issues/2187
Legit, if you have a cached_file (i.e. exists and accessible), you can return it to the caller
works seamlessly throughout and in our current on premise servers...
I'm assuming via something close to what I suggested above with .netrc ?
Hi @<1523702000586330112:profile|FierceHamster54>
I think I'm missing a few details on what is logged, and ref to the git repo?
Hi @<1523702932069945344:profile|CheerfulGorilla72>
Please tell me what RAM metric is tracked by ClearML?
Free RAM is the entire machine free RAM
Yeah htop shows odd numbers as it doesn't "count" allocated buffers
specifically you can see the code here:
None
HealthyStarfish45 what exactly did you have in mind, in terms of the widget ?
Simple git clone on that repo works well
On the machine running the trains-agent ?
Could you right click on the failed experiment , select reset and send it again for execution?
Could that error be a random network issue ?
(Basically this seems like a generic network error not actually related to the trains-agent)
Is the trains-agent
running in docker mode or venv mode?
CooperativeFox72 yes 20 experiments in parallel means that you always have at least 20 connection coming from different machines, and then you have the UI adding on top of it. I'm assuming the sluggishness you feel are the requests being delayed.
You can configure the API server to have more process workers, you just need to make sure the machine has enough memory to support it.
Meanwhile you can just sleep for 24hours and put it all on the services queue. it should work 🙂
Example here:
https://github.com/allegroai/trains/blob/master/examples/services/cleanup/cleanup_service.py
command line 🙂
cmd.exe / bash
or shall I call the Task.init even from the agent
WorriedParrot51 I think something is lost here.
Task.init() is always called, even when the agent is executing the code. The difference is in what happens inside the Task.init() call. When the codebase itself is executed by the trains-agent, it signals through OS environment to the task.init() that instead of a new created task, it should use the already created one. from this point all data flows from the trains-server back into the c...
Hi WorriedParrot51
Take a look at the Experiment execution section:
there is script
and working directory
working directory is the base of the git repository (which is cloned into the docker file)
So if for some reason trains did not properly detect the current working dir here is what should solve the issue, without changing the PYTHONPATH
script path: ./sub_folder/scripy.py working directory: .
What do you think?
Hi WorriedParrot51
So I think what you need is to map your external code into the docker, is that correct?
Also you want to always set the PYTHONPATH.
You can achieve both by configuring the trains.conf:
Here you can always add a predefined environment and mount point, regardless of the docker image or other docker argument arguments:
https://github.com/allegroai/trains-agent/blob/master/docs/trains.conf#L98
Will this solve the issue?
WorriedParrot51 I now see ...
Two solutions that I can quickly think of:
In the code add:import sys sys.path.append('./my_sub_module')
Assuming you always have to add the sub-directories to make the code work, and assuming they are part of the repository, this is probably the table stolution
2. In the the UI in the Docker base image, add -e PYTHONPATH=/folder
or from code (which is exactly what you did)
a clean interface task.set_base_docker('nvidia/cids -e PYTHONPATH=/folder")