Are you also adding those metrics to the experiment table as extra columns ?
GreasyPenguin14 we never had troubles with Task.init
(or any other clearml calls) and working with the pycharm debugger, we use it quite extensively ...
Actually on a very similar setup...
Could you send the full log?
Or maybe a code snippet to reproduce this behavior ?
(We did notice they fixed a few issues with the debugger in 2020.3.3 so it's worth upgrading)
Hi @<1598487094601191424:profile|MysteriousCow84>
only one of them uses an already created venv from cache for this task. And the other node starts to re-create the same virtual environment.
Just be clear, the second one is running, but it does not use the same venv as the other one (that is running in parallel), is that correct?
PompousParrot44 unfortunately not yet 😞
But the gist is :
MongoDB stores experiment data (i.e. execution parameters, git ref etc.)
ElasticSearch stores results (i.e. metrics console logs, debug image links etc.)
Does that help?
PompousParrot44 I see what you mean, yes multiple context switching might cause a bit of decline in performance. not sure how much though ... The alternative of course is to set cpu affinity... Anyhow if you do get there we can try to come up with something that makes sense, but at the end there is no magic there 🙂
According to you the VPN shouldn't be a problem right?
Correct as long as all parties are on the same VPN it should work, all the connections are always http so basically trivial communication
Why does my task execution freeze after pip installation (running agent in foreground mode)?
Hi AdventurousButterfly15
Are you running in agent docker mode or venv mode ?
What do you mean freeze? do you see anything on the Taks console log in the UI? what's the host OS ?
Hmmm, can you view the settings? that's the only thing I can think of at the moment that will be diff between your setup and the working one...
Also, is there a way for you to have the trains-server behind https (on your GCP)
Hi TightDog77 _
HTTPSConnectionPool(host='
', port=443): Max retries exceeded with url: /upload/storage/v1/b/models/o?uploadType=resumable (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2633)')))
This seems like a network error to GCP, (basically GCP python package thows it)
Are you always getting this error? is this something new ?
Thanks@doru! BTW if you are running a code from outside the trains repo, do you still get the double package?
Nothing that can't be worked around but for automation I don't think creating a TriggerScheduler with an existing name should be allowed
DangerousDragonfly8 I think I understand , basically you are saying the fact a user can create two triggers with the same name can create some confusion ?
It also sucks a bit that each TriggerScheduler will run in it's own pod in kubernetes.
Actually this depends on how you spin it, and you can actually spin a a service agents running multiple...
total size 5.34 GB, 1 chunked stored (average size 5.34 GB)
PanickyAnt52 The issue itself the Dataset will not break files (it will package into multiple zip files a large folder, but not break the file itself).
The upload itself is limited by the HTTP interface (i.e. 2GB file size limit)
I would just encode it into multiple Arrow files
does that make sense ?
Hi OutrageousSheep60
Do you mean something like:
https://github.com/allegroai/clearml/tree/master/examples/datasets
?
I have to admit mounting it to a different drive is a good reason to bring this feature back, the reasoning was it means the agent needs to make sure it manages them (e.g. multiple agents running on the same machine)
Awesome! Any chance you feel like contributing it, I'm sure ppl would be thrilled 🙂
Hi SucculentBeetle7
Sure check the latest implementation, it now has "start" and "start_remotely" 🙂
Hi RoundMosquito25
This is a bit old but probably a good start:
https://clear.ml/blog/stacking-up-against-the-competition/
tl;dr
ClearML advantages (at least a few I can think of)
Scales way better Enables out of the box experiment orchestration (i.e. remote execution etc) Data management Nicer UI Full RestAPI Full MLops platform Model serving Query-able model repositoryProbably more 🙂
OddAlligator72 let's separate the two issues:
Continue reporting from a previous iteration Retrieving a previously stored checkpointNow for the details:
Are you referring to a scenario where you execute your code manually (i.e. without the trains-agent) ?
FYI: if you need to query stuff you can always look directly in the RestAPI:
https://github.com/allegroai/clearml/blob/master/clearml/backend_api/services/v2_9/projects.py
https://allegro.ai/clearml/docs/rst/references/clearml_api_ref/index.html
StaleKangaroo85 check https://demoapp.trains.allegro.ai/projects/0e152d03acf94ae4bb1f3787e293a9f5/experiments/193ac2bced184c49a57658fceb4bd7f9/info-output/metrics/plots?columns=type&columns=name&columns=status&columns=project.name&columns=user.name&columns=started&columns=last_update&columns=last_iteration&order=last_update on the demo server, seems okay to me...
Hi GreasyPenguin14
Quick question, any reason not to use a 2D scatter ? or a histogram (or any other non time-series plot)?
HI PlainSquid19 could you add a bit more information? Are you running trains-agent ? is it in docker/venv mode ? what's the trains/trains-agent/trains-server versions ?
@<1523701523954012160:profile|ShallowCormorant89> can you verify it is reproducible in 1.9.3 ? because if it is I'd like to fix that 🙂
will it be possible for us to configure the "new run" button in a way so that it always clones from a particular pipeline ?
What do you mean by "particular pipeline" ? by default it will clone the last successful one, and by right clicking a specific one you can run a copy of that one. what am I missing ?
Hi JumpyPig73
Funny enough this is being fixed as we speak 🙂
The main issue is that as you mentioned, ClearML does not "detect" the exit code when os.exit() is called, and this is why it is "missing" the failed test (because as mentioned, all exceptions are caught). This should be fixed in the next RC
Since I can't use the
torchrun
comand (from my tests, clearml won't use it on the clearm-agent), I went with the
@<1556450111259676672:profile|PlainSeaurchin97> did you check this example?
None
PompousParrot44 the fundamental difference is that artifacts are uploaded manually (i.e. a user will specifically "ask" to upload an artifact), models are logged automatically and a user might not want them uploaded (imagine debugging sessions, or testing).
By adding the 'upload_uri' arguments, you can specify to trains that you want all models to be automatically uploaded (not just logged).
Now here is the nice thing, when running using the trains-agent, you can have:
Always upload the mod...
BTW: get_tasks has project_name argument, I would just use it 🙂