Reputation
Badges 1
25 × Eureka!Nothing that can't be worked around but for automation I don't think creating a TriggerScheduler with an existing name should be allowed
DangerousDragonfly8 I think I understand , basically you are saying the fact a user can create two triggers with the same name can create some confusion ?
It also sucks a bit that each TriggerScheduler will run in it's own pod in kubernetes.
Actually this depends on how you spin it, and you can actually spin a a service agents running multiple...
MelancholyBeetle72 it will be great if you could also open an issue on Trains and reference the pytorch lightning issue, could you please?
So this should be easier to implement, and would probably be safer.
You can basically query all the workers (i.e. agents) and check if they are running a Task, then if they are not (for a while) remove the "protection flag"
wdyt?
StickyBlackbird93 the agent is supposed to solve for the correct version of pytorch based on the Cuda in the container. Sounds like for some reason it fails? Can you provide the log of the Task that failed? Are you running the agent in docker-mode , or inside a docker?
Change to add_missing_installed_packages=False,
here, and see if you end up with git diff
https://github.com/allegroai/clearml/blob/1f82b0c4010799be6157f5c845c7f6ac48e71c0c/clearml/backend_interface/task/populate.py#L158
Yes the "epoch_loss" is the training epoch loss (as expected I assume).
thought that was just the loss reported at the end of the train epoch via tf
It is, isn't that what you are seeing ?
I though the dataset was only linked to the fileserver and not to the specific url used to upload it.Β (
ShinyRabbit94 yep exactly! the idea is that you can actually do the storage on any solution (S3/GS etc.) the file server is just the default one π
ShinyWhale52 any time π
Feel free to followup with more questions
Hi AbruptWorm50
the second "epoch loss" is the scalar for the "validation" process (see "validation: epoch loss" series is actually the TF file/folder prefix automatically added)
Make sense ?
HurtWoodpecker30 could it be you hit a limit of some sort ?
your account has 2FA enabled and you must use a personal access token instead of a password.
I'm assuming you have created the personal access token and used it, not the pass
Hi LethalDolphin75
I think you are right there isn't one (although I remember a discussion about it...)
Anyhow it will be very easy to implement, just inherit from:
https://github.com/allegroai/clearml/blob/400c6ec103d9f2193694c54d7491bb1a74bbe8e8/clearml/automation/parameters.py#L111
And return the power of the parent value here:
https://github.com/allegroai/clearml/blob/400c6ec103d9f2193694c54d7491bb1a74bbe8e8/clearml/automation/parameters.py#L146
And
https://github.com/allegroai/...
Hi AdventurousWalrus90
Thank you for the kind words! π
/home/usr_338436_ulta_com/.clearml/venvs-builds/3.7/.gitignore
so this is the error on the agent ?
Notice there is no need to upgrade the server, only the ClearML python package
BTW: we are now adding "datasets chunks for a more efficient large dataset storage"
SmarmySeaurchin8
When running in "dev" mode (i.e. writing the code) only packages imported directly are registered under "installed packages" , then when the agent is executing the experiment, it will update back the entire environment (including derivative packages etc.)
That said you can set detect_with_pip_freeze
to true (in trains.conf) and it will basically store the entire pip freeze.
https://github.com/allegroai/trains/blob/f8ba0495fb3af1f99732fdffbbccd2fa992934a4/docs/trains.c...
Hi SolidSealion72
"/tmp" contained alot of artifacts from ClearML past runs (1.6T in our case).
How did you end up with 1.6TB of artifacts there? what are the workflows on that machine? at least in theory, there should not be any leftover in the tmp folder, after the process is completed.
from clearml.backend_api.session.client import APIClient c = APIClient() c.projects.update(project="project-id-here", system_tags=[])
(also im a bit newer to this world, whats wrong with openshift?)
It's the most difficulty Kubernetes flavor to work with π
weve already tried that but it didnt really change ...
Can you provide full log? as well as how you created the pods ?
PompousBeetle71 If this is argparser and the type is defined, the trains-agent will pass the equivalent in the same type, with str
that amounts to '' . make sense ?
What are you seeing?
how I can turn off git diff uploading?
Sure, see here
None
Hi SmugLizard24
The question is what is the reason of the issue?
That is a good question, could it be out of memory? (trying to compress or send the file in one chunk?)
Hi @<1674588542971416576:profile|SmarmyGorilla62>
You mean on your elastic / mongo local disk storage ?
Hi MinuteCamel2
I can I disable it from automatically uploading model checkpoints to ClearML servers?
Maybe this one can help :)
https://www.youtube.com/watch?v=etGjxOKG9lo
deleted all of the models from my ClearML project but I still receive this message. Do you know why?
It might take it a few hours to update... π
MysteriousBee56 what do you mean "save Scalars on the machine"? All metrics are sent to the trains server. You can later retrieve them from code, if you need.
Oh what if the script is in the container already?
Hmm, the idea of clearml is that the container is a "base environment" and code is "injected", this makes sure it is easy to reuse it.
The easiest way is to add an "entry point" scripts that just calls the existing script inside the container.
You can have this python initial script on your local machine then when you call clearml-task
it will upload the local "entry point" script directly to the Task, and then on the remote machin...