Reputation
Badges 1
25 × Eureka!Hi PompousBeetle71
I remember it was an issue, but it was solved a while ago. Which Trains version are you using?
@<1687643893996195840:profile|RoundCat60> can you access the web UI over https ?
can we also put the path to the CA?
Yes :)
this is not the case as all the scalars report the same iterations
MassiveHippopotamus56 could it be the the machine statistics? (i.e. cpu/gpu etc. these are considered scalars as well...)
No, an old experiment changed, nothing was rerun
ohh, that is odd. I think the max iteration value is stored on the DB, which is odd if it changed after an update.
BTW: just making sure, could it be these Tasks were imported ? (i.e. offline execution + import)
can configuration objects refer to one-another internally in ClearML?
Interesting, please explain?
Hi WickedGoat98
This sounds like a great design (obviously you have scale in mind 😉 ) Feel free to ask "stupid" questions, based on what you already wrote I doubt they will be
A few questions that come to mind (probably a few others after):
You mentioned FS synchronization, from where? i.e. what is the single source of truth ? K8s (Rancher 2.0 is basically k8s manager) can take care of mounting volumes, so no need to sync, is this a valid solution ?
BTW : (you can drag and drop an i...
and about a month later for some reason the initial iteration seems to have changed to 0
Hmm, I see your point. Just so I fully understand, your are not saying Old experiments were changed, but new experiments (running the same code-ish) have a totally different max iterations value. Is this correct ?
We're not using a load balancer at the moment.
The easiest way is to add ELB and have amazon add the httpS on top (basically a few clicks on their console)
The data I'm syncing by an data provider wich supports only an ftp connection....
Right ... that makes sense :)
No worries WickedGoat98 , feel free to post questions when they arise. BTW: we are now improving the k8s glue, so by the time you get there the integration will be even easier 🙂
Should work, follow the backup process, and restore into a new machine:
None
Hmm is this similar to this one https://allegroai-trains.slack.com/archives/CTK20V944/p1597845996171600?thread_ts=1597845996.171600&cid=CTK20V944
Hi JitteryCoyote63
The NVIDIA_VISIBLE_DEVICES
is set automatically for the process the trains-agent spins, so from your code, it is transparent, you can only "see" GPU 0.
(Obviously not using docker you can forcefully change the OS environment in runtime, but you should avoid that ;))
The .ssh is mounted, but the owner is my local user,
sudo -H clearml-agent ...
to allow sudo to access home
ScantWorm7
Tensorboard is automatically captured and sent to the trains server. This is in addition to the local copy of your TB files. Actually in most cases the local copy is redundant
what do you have here in your docker compose :
None
Hi RipeGoose2
Can you try with the latest from git ?pip install -U git+
Okay, I think I understand, but missing something. It seems you call get_parameters from old API , is your code actually calling get_parameters ? The trains-agent runs the code externally, whatever happens inside the agent should have now effect on the code. So who exactly is calling the task.get_parameters, and well, why ? :)
RoughTiger69 yes I think "Scale" tier covers it 😉
To store all the debug samples, also it can store all the models (if you configure the output_uri=' http://file_server_here:8081 ') Yes: instead of the file server have 's3://<ip_of_minio>:9000/bucket' make sure you add the credentials for the minio in the trains.conf Yes, basically once you have the creendtials in the trains.conf, you could do StorageManager.get_local_copy('s3://<minio>:9000/bucket/file') (also upload of course 🙂 )
PompousBeetle71 you can check this example:
https://github.com/allegroai/trains/blob/master/examples/distributed/example_torch_distributed.py
I think it should help, if you want a more manual approach, you can check the POpen subprocesses here:
https://github.com/allegroai/trains/blob/master/examples/distributed/example_subprocess.py
Hmm you will have to set the trains-server on a machine somewhere, it can be any machine win / Mac / Linux
Hi WickedGoat98
Will I need to wrap their execution in python by system calls?
That would probably be the easiest solution 🙂
Then you can plug it into your pipeline as a preprocessing Task:
You can check this example:
https://github.com/allegroai/trains/tree/master/examples/pipeline
Hi @<1663354518726774784:profile|CrookedSeal85>
However, I systematically notice a jump of some number of "ghost iterations" when resuming my trainings...
Try the following:
task = Task.init(..., continue_last_task=0
from the Task.init docstring (Notice this value can be both boolean and integer)
:param bool continue_last_task: Continue the execution of a
...
- An integer - Specify initial iteration offset (override the auto automatic last_iteratio...
looks like service-writing-time for me!
Nice!
persist/restore state so that tasks are restartable?
You mean if you write preemption-ready training code ?
I'm wondering what happens if i were to host the instance and one of these were to go down from time to time in production, as the deployments provided by the helm chart are not redundant.
Long story short, it will break the clearml-server, please do not take them down, if you do need to do that, also take down the clearml-server. The python clients will wait until it is up again, so no session would be destroyed
are you referring to the same line? 47 in cache.py?
preempting lower priority tasks to allow a higher priority task to come in
Well this is usually outside of the scope of "single researcher" / "tiny team"...
This typically a large scale problem
That said, it will be fairly easy to write a service that aborts Tasks, "tags them to be "continued", then later (at night?!) push them back into a queue... wdyt?
Are you getting the error from boto failing to launch additional ec2 instances ?