BTW:
I have very small text files that make up a dataset and compression seems to take most of the upload time
How long does it take? and how come it is not smaller in size ?
is this a config file on your side or something I can change, if we had enterprise version?
Yes, this is one of the things you can configure
Hi @<1523701868901961728:profile|ReassuredTiger98>
Anyone here with any idea why my service tasks get aborted when going to sleep?
I think I understand the issue, clearml==1.4.0
try running with the latest clearml (1.10.x)
It will keep pinging the backend "Im alive" so the backend does not think this process is dead (which I suspect what happened, and after 2 hours the backend basically set the Task to aborted because it "thought" it was killed)
OutrageousSheep60 before I can answer, maybe you can explain why "zipping" them does not fit your workfow ?
Hmm SuccessfulKoala55 any chance the nginx http was pushed to v1.1 on the latest cloud helm chart?
Hi @<1724235687256920064:profile|LonelyFly9>
So, I noticed that with the REST API at least the
/tasks.get_all
endpoint appears to have an undocumented maximum page size of 500.
Yeah otherwise the request size might be too big, but you have pagination:
page
optional Page number, returns a specific page out of the resulting list of tasks
Minimum value : 0 integer
Hi RoughTiger69
unfortunately, the model was serialized with a different module structure - it was originally placed in a (root) module called
model
....
Is this like a pickle issue?
Unfortunately, this doesnβt work inside clear.ml since there is some mechanism that overrides the import mechanism using
import_bind
.
__patched_import3
What error are you getting? (meaning why isn't it working)
Hi @<1547028031053238272:profile|MassiveGoldfish6>
The issue I am running into is that this command does not give me the dataset version number that shows up in the UI.
Oh no, I think you are correct, it will not return the version per dataset π (I will make sure we add it)
But with the dataset ID you can grab all the properties:Dataset.get(dataset_id="aabbcc").version
wdyt
I want to inject a bash command after the repo has been clone (and maybe even after the venv has been installed).
LazyTurkey38 the created venv inherits from the system environment, so in theory you can do all the installation on the system python and the created venv will just inherit the packages, no?
(btw: just to clarify, there is only one entry point for the custom bash script and that is before everything, so users can configure the container before the agent starts)
yes i can communicate with the server, i managed to put tasks in the queue and retrieve them as well as running tasks with metrics reporting
Through the UI or python code ?
Hi UnevenHorse85
As far as I understand, users use logins and passwords specified in config/apiserver.conf to access webserver UI and key/secret key from their local ~/clearml.conf to access apiserver.
Correct π
access apiserver. What is the use of all other security keys
To be able to configure the SDK client (i.e. clearml package) from OS environment and not clearml.conf file
if project_name is None and Task.current_task() is not None: project_name = Task.current_task().get_project_name()
This should have fixed it, no?
Hi PompousParrot44
So do you mean something like:
` task_model_a = Task.get('id_a')
task_model_b = Task.get('id_b')
model_a_file = task_model_a.models['output][-1].get_local_copy()
model_b_file = task_model_b.models['output][-1].get_local_copy() `
JuicyFox94
NICE!!! this is exactly what I had in mind.
BTW: you do not need to put the default values there, basically it reads the defaults from the package itself trains-agent/trains and uses the conf file as overrides, so this section can only contain the parts that are important (like cache location credentials etc)
And is there an easy way to get all the metrics associated with a project?
Metrics are per Task, but you can get the min/max/last of all the tasks in a project. Is that it?
I might gave an idea, could you test with:
` from clearml import Task
Task._report_subprocess_enabled = False
...
real code here `
Hi DeliciousBluewhale87
Hmm, good question.
Basically the idea is that if you have ingestion service on the pods (i.e. as part of the yaml template used by the k8s glue) you can specify to the glue what are the exposed ports, so it knows (1) what's the maximum of instances it can spin, e.g. one per port (2) it will set the external port number on the Task, so that the running agent/code will be aware of the exposed port.
A use case for it would be combing the clearml-session with the k8s gl...
Is the code in this "other" repo downloaded to the agent's machine? Or is the component's code pushed to the machine on which the repository is?
Yes this repo is downloaded into the agent, so your code has access to it
You mean to add the extra index url?
you could use :
https://github.com/allegroai/clearml-agent/blob/5f0d51d485629e9dfc2d826622524461e3fcae8a/docs/clearml.conf#L63
BattyLion34
Maybe something inside the task is different?!
Could you run these lines and send me the result:from clearml import Task print(Task.get_task(task_id='failing task id').export_task()) print(Task.get_task(task_id='working task id').export_task())
Okay I found the issue ( I think),
If the images are reported very quickly, it will "decide" you are about to override the previous one (i.e. 101 -> overwriting 0, which makes sense, the bug was it would disable the 101 from uploading and not the 0 π )
Test fix:
in /backend_interface/metrics/events.py
, line 292, change:
` last_count = self._get_metric_count(self.metric, self.variant, next=False)
if abs(self._count - last_count) > int(self._file_history_size):
...
SoreDragonfly16 notice that if in the web UI you aborting a task it will do exactly what you described, print a message and quit the process. Any chance someone did that?
ModelCheckpoint('best_model', save_best_only=True)
That worked for me now, what's the diff
I think this is the discussion you are after:
https://clearml.slack.com/archives/C01H5VAUZ8R/p1612452197004900?thread_ts=1612273112.002400&cid=C01H5VAUZ8R
Hi DilapidatedDucks58 ,
Are you running in docker or venv mode?
Do the works share a folder on the host machine?
It might be syncing issue (not directly related to the trains-agent but to the facts you have 4 processes trying to simultaneously access the same resource)
BTW: the next trains-agent RC will have a flag (default off) for torch-nightly repository support π