Reputation
Badges 1
25 × Eureka!do I need to create a brand new dataset with a new name that inherits from the original?
Yes, you just create a new version, specify the parent one, add changes and close it.
If you later need you can squash a version (same ides as git squash). Make sense ?
although ideally i'd like to tell it exactly where to unzip it.
Ohh you can use .get_local_mutable_copy()
It will unzip it to specific folder
maybe you can check alsoΒ
--version
Β that returns the helm menu
What do you mean? --version on cleaml-task ?
PungentLouse55 could you test with 0.15.2rc0 see if there is any difference ?
And you are calling Task.init? And the scalars show under scalars and the images are not under debug samples?
Hi RoundSeahorse20
Try the following , let me know if it worked.clear_logger = logging.getLogger('clearml.metrics') clear_logger.setLevel(logging.ERROR)
The driver script (the one initializes models and initializes a training sequence) was not at git repo and besides that one, everything is.
Yes there is an issue when you have both git repo and totally uncommitted file, since clearml can store either standalone script or a git repository, the mix of the two is not actually supported. Does that make sense ?
Ohh yes, if the execution script is not on git and git exists, it will not add it (it will add it if it is in a tracked file via the uncommitted changes section)
ZanyPig66 in order to expand the support to your case. Can you explain exactly which files are on git and which are not?
However, this one should be a feature to work on, and should be fairly easy to implement.
Feel free to add as GitHub issue π
Main challenge is understanding what needs to be added as "uncommitted changes"
inΒ
Β issues a delete command to the ClearML API server,...
almost, it issues the boto S3 delete commands (directly to the S3 server, not through the cleaml-server)
And that I need to enter an AWS key/secret in the profile page of the web app here?Β (edited)
correct
what is the best approach to update the package if we have frequent update on this common code?
since this package has an indirect affect on the model endpoint, I would package with the preprocess code of the endpoint.
Each server is updating it's own local copy, and it will make sure it can take it and deploy it hand over hand without breaking its ability to serve these endpoints.
the "wastefulness" of holding multiple copies is negligible when comparing to a situation where everyone ...
task.mark_completed()
You have that at the bottom of the script, never call it on yourself, it will kill the actual process.
So what is going on you are marking your own process for termination, then it terminates itself leaving the interpreter and this is the reason for the errors you are seeing
The idea of mark_* is to mark an external Task, forcefully.
By just completing your process with exit code (0) (i.e. no error) the Task will be marked as completed anyhow, no need to call...
I think the limit is a few GB, I'm not sure, I'll have to check
And yes the oldest experiments will be deleted first (with the exception of published experiments, they will be deleted last)
You need to mount it to ~/clearml.conf
(i.e. /root/clearml.conf)
FiercePenguin76
So running the Task.init from the jupyter-lab works, but running the Task.init from the VSCode notebook does not work?
BitingKangaroo95 nice work π
I think that what did it was:
change the sshd_config
so that it allows port forwarding
, agent forwarding
and x11 forwarding
But just in case, it might be there was a pre existing SSH identifier on your machine, and hence the error.
clear known_hosts under ~/.ssh was also something I would try π
Hi SmallDeer34
Is the Dataset in clearml-data ? If it is then Dataset.get().get_local_copy() will get you a cached local copy of the entire dataset.
If it is not, then you can use StorageManager.get_local_copy(url_here) to download the dataset.
- Any Argparser is automatically logged (and later can be overridden from the UI). Specifically HfArgumentParser will be automatically logged https://github.com/huggingface/transformers/blob/e43e11260ff3c0a1b3cb0f4f39782d71a51c0191/examples/pytorc...
so other process can use it
This is why there is a model repository, so you can query the last model created, or by name or tag or query the Task that created it and then via the Task the model and it's location.
This is a stable way to make sure your application code (the one using the model) will get to use stable models regardless of the training processes.
I would add a Tag to the model and then search based on the project and the tag, wdyt?
AstonishingRabbit13 so is it working now ?
the only thing that missing is some plots on the clearml server (app ) when i got to the details of the train i cannot see the matrix confusion for example ( but its exists on the bucket )
How do you report the "matrix confusion" ? (I might have an idea on what's the difference)
is there something else in the conf that i should change ?
I'm assuming the google credentials?
https://github.com/allegroai/clearml/blob/d45ec5d3e2caf1af477b37fcb36a81595fb9759f/docs/clearml.conf#L113
And you are seeing a bunch of the GS SSL errors?
One additional thing to notice, docker will Not actually limit the "vioew of the memory" it will just kill the container if you pass the memory limit, this is a limitation of docker runtime
Error 101 : Inconsistent data encountered in document: document=Output, field=model
Okay this point to a migration issue from 0.17 to 1.0
First try to upgrade to 1.0 then to 1.0.2
(I would also upgrade a single apiserver instance, once it is done, then you can spin the rest)
Make sense ?
Can you share the modified help/yaml ?
Did you run any specific migration script after the upgrade ?
How many apiserver instances do you have ?
How did you configure the elastic container? is it booting?
ResponsiveCamel97
could you attach the full log?
I can probably have a python script that checks if there are any tasks running/pending, and if not, run docker-compose down to stop the clearml-server, then use boto3 to trigger the creating of a snapshot of the EBS, then wait until it is finished, then restarts the clearml-server, wdyt?
I'm pretty sure there is a nice way, let me check soemthing