Reputation
Badges 1
25 × Eureka!Hi PompousParrot44
Unfortunately this is still not available in the UI. As part of the Controllers, we thought of having a "Cron" controller that Clones base xperiments at a given time and schedulers them for execution. We are looking for specific use cases, to make sure this will actually answer the requirements of users.
It looks as if that might be what you are after, is this correct? What exactly is the use case here? Is it a stable daily cron job (for example retrain the an experiment ...
should I update nodejs in centos image ?
I think so, it might have been forgotten
JitteryCoyote63 while it's running, could you give me a few details on the setup, maybe I can reproduce it.
Is it using pytorch distributed ?
Are all models uploaded to S3 ?
etc.
RobustGoldfish9
I think you need to set the trains-agent docker to be aware of the host, so it knows how to mount data/cache/configurations into the sibling docker
It should look something like:TRAINS_AGENT_DOCKER_HOST_MOUNT="/mnt/host/data:/root/.trains"
So if running a docker:docker run -e TRAINS_AGENT_DOCKER_HOST_MOUNT="/mnt/host/data:/root/.trains" ...
trains-agent build --docker nvidia/cuda --id myTaskId --target base_env_services
It's building a gpu enabled docker...
you might want a diff container or to specific --cpu-only
@<1610083503607648256:profile|DiminutiveToad80> try to turn on:
None
enable_git_ask_pass: true
Hi @<1523701260895653888:profile|QuaintJellyfish58>
You mean some "daemon service" aborting Tasks that do not end after X hours? or is it based on CPU/GPU utilization?
RattySeagull0 I think you are correct, python 3.6 is the installed inside the docker. Is it important to have 3.7 ? You might need another docker (or change the installation script and install python 3.7 inside)
Do you have python 3.7 in the docker ?
SmilingFrog76 this is not a weird mechanism at all , this is proper HPC scheduler πtrains-agent
is not actually aware of other nodes, it is responsible for launching a Task on its own hardware (with whatever configuration it was set). What can be done is to use the trains-agent
inside a 3rd party scheduler and have the scheduler allocate the node and trains-agent spin the experiment. There is a k8s example here: basically pulling jobs for the trains-server queue and pushing ...
Firstly, thank you for your efforts and your support.
Thanks SmugOx94 !
Are you running trains-agent
in docker mode? The aforementioned scripts are executed before, the experiment is being cloned, they are meant to be a part of the docker setup, not a per experiment script.
You could try to edit the experiment and have:
Working Directory: "."
(that means the root of the repository)
Script Path: "experiments_that_uses_library/train.py"
This will make sure you can do "import l...
HealthyStarfish45 you mean like replace the debug image viewer with custom widget ?
For the images themselves, you can get heir urls, then embed that in your static html.
You could also have your html talk directly with the server REST API.
What did you have in mind?
HealthyStarfish45 what exactly did you have in mind, in terms of the widget ?
HealthyStarfish45 this sounds very cool! How can I help?
Sure thing, feel free to ping π
LudicrousParrot69 ,
Are you trying to post execution parse the attached Table, then put it into a CSV on the HPO Task ?
I see now, give me a minute I'll check
LudicrousParrot69 I would advise the following:
Put all the experiments in a new project Filter based on the HPO tag, and sort the experiments based on the metric we are optimizing (see adding custom columns to the experiment table) And select + archive the experiments that are not usedBTW: I think someone already suggested we do the auto archiving inside the HPO process itself. Thoughts ?
, is the team open to PRs from external people?
Yes please do! PRs are welcomed! I thought we fixed the GitHub readme to reflect it, anyhow I'll make sure we do π
Found it, definitely a bug in the callback, it has not effect on the HPO process itself
Hi JitteryCoyote63 ,
I remember seeing something similar on our GitHub...
The error itself is pip failing to run "git clone" , seems like a weird network connection error (TLS is the HTTPS security layer)
Shout-out to Emilio for quickly stumbling on this rare bug and letting us know. If you have a feeling your process is stuck on exit, just upgrade to 1.0.1 π
JitteryCoyote63 what am I missing?
What are the errors you are getting (with / without the envs)
the second seems like a botocore issue :
https://github.com/boto/botocore/issues/2187
Yes, hopefully they have a different exception type so we could differentiate ... :) I'll check
Do you want to PR it? should be a quick fix
Legit, if you have a cached_file (i.e. exists and accessible), you can return it to the caller
BattyLion34 the closest I can think of the is monitoring class that can easily be extended.
Datasets are a type of Task, so we can monitor a project and trigger an action when we see a change in number of Tasks/Datasets that are completed.
Monitoring class:
https://github.com/allegroai/clearml/blob/master/clearml/automation/monitor.py
Monitoring example:
https://github.com/allegroai/clearml/blob/master/examples/services/monitoring/slack_alerts.py
I think a dataset monitoring example wil...
if I useΒ
report_image
Β can I get a URL to it somehow?
Let me check ...