Reputation
Badges 1
25 × Eureka!yey working π
Getting the last checkpoint can be done via.
Task.get_task(task_id='aabbcc').models['output'][-1]
It does, tested π but you should as well
PungentLouse55 hmmm
Do you have an idea on how we could quickly reproduce it?
SubstantialElk6
Notice if you are using a manual setup the default is "secure: false" you have to change it to "secure: true":
https://github.com/allegroai/clearml-agent/blob/176b4a4cdec9c4303a946a82e22a579ae22c3355/docs/clearml.conf#L251
I still wonder how no one noticed ... (maybe 100 unique title/series report is relatively high threshold)
Hi SubstantialElk6
Unicodeencodeerror:'ascii' codec can't encode characters in position 296-297: ordinal not in range (128)Β (edited)
I'm assuming this is the usual UTF8 missing from the container.
Can you try to launch it with PYTHONIOENCODING=utf-8 ?
If I try to connect a dictionary of typeΒ
dict[str, list]
Β withΒ
task.connect
, when retrieving this dictionary with
Wait, this should work out of the box, do you have any specific example?
Hi PompousParrot44
You can check the cleanup service example.
It sleeps for 24 hours then spins up and does its thing.
You can always launch this service tasks on the services queue, its purpose is to run those services on the trains-server as additional CPU services. They will also be registered as service nodes, so you have visibility into which service is running.
In order to clone a task and wait for its completion.
Use the TrainsJob https://github.com/allegroai/trains/blob/65a4a...
what is user properties
Think of them as parameters you can add post execution, that you can also add to the Task table (i.e. customize columns)
how can I add parameters
task.set_user_properties([{"name": "backbone", "description": "network type", "value": "great"},]
Hi JitteryCoyote63 when you run the trains-agent it tells you where it puts the logs, it's a temp auto generated filename usually under /tmp/Running TRAINS-AGENT daemon in background mode, writing stdout/stderr to /tmp/.trains_agent_daemon_out4uahki3i.txt
to enable access to the s3 bucket. In this case I wonder how clearml sdk gets access to the s3 bucket if it relies on secret access key and access key id.
Right, basically someone needs to configure the "regular" environment variables for boto to use the IAM role, clearml will basically uses boto, so it should be transparent. does that make sense ? How do you spin the job on the k8s cluster and how do you configure it?
ince these are temp credentials awe need to use the sessi...
PompousParrot44 unfortunately not yet π
But the gist is :
MongoDB stores experiment data (i.e. execution parameters, git ref etc.)
ElasticSearch stores results (i.e. metrics console logs, debug image links etc.)
Does that help?
What about the epochs though? Is there a recommended number of epochs when you train on that new batch?
I'm assuming you are also using the "old" images ?
The main factor here is the ratio between the previously used data and the newly added data, you might also want to resample (i.e. train on more) new data vs old data. make sense ?
PompousParrot44 now that I think about it, you might be able to limit the cpu affinity, would that help?
GreasyLeopard35 from the implementation:
https://github.com/allegroai/clearml/blob/fcad50b6266f445424a1f1fb361f5a4bc5c7f6a3/clearml/automation/parameters.py#L215
Which basically returns the "self.base" (default) 10 to the power of the selected value:10**-3 = 0.001
So how would I get a negative value ?
So the naming is a by product of the many TB created (one per experiment), if you add different naming ot the TB files, then this is what you'll be seeing in the UI. Make sense ?
Hi @<1533620191232004096:profile|NuttyLobster9>base_task_factory is a function that gets the node definition and returns a Task to be enqueued ,
pseudo code looks like:
def my_node_task_factory(node: PipelineController.Node) -> Task:
task = Task.create(...)
return task
Make sense ?
Hi ShallowCormorant89
Can you verify the http link is valid? Can you download it from code on your machine (i.e. not via an agent), maybe 8081 port is blocked from the agent machine to the server?
and this path should follow linux folder structure not a single file like the current .zip.
I like where this is going π
So are we thinking like a "shared" folder where the data is kept "warm" and a single source of truth where the packaged zip file is stored (like object storage, e.g. S3)
Yep, automatically moving a tag
No, but you can get the last created/updated one with that tag (so I guess the same?)
meant like the best artifacts.
So artifacts get be retrieved like a dict:
https://github.com/allegroai/clearml/blob/master/examples/reporting/artifacts_retrieval.pyTask.get_task(project_name='examples', task_name='artifacts example').artifacts['name']
The idea of queues is not to let the users have too much freedom on the one hand and on the other allow for maximum flexibility & control.
The granularity offered by K8s (and as you specified) is sometimes way too detailed for a user, for example I know I want 4 GPUs but 100GB disk-space, no idea, just give me 3 levels to choose from (if any, actually I would prefer a default that is large enough, since this is by definition for temp cache only), and the same argument for number of CPUs..
Ch...
Hi @<1569858449813016576:profile|JumpyRaven4> could you test the fix? just pull & run
allegroai/clearml-serving-triton:1.3.1
allegroai/clearml-serving-inference:1.3.1
But I do not know how it can help me:(
In your code itself after the Task.init call add:task.set_initial_iteration(0)See reply here:
https://github.com/allegroai/clearml/issues/496#issuecomment-980037382
Hi @<1545216070686609408:profile|EnthusiasticCow4>
My biggest concern is what happens if the TaskScheduler instance is shutdown.
good question, follow up, what happens to the cron service machine if it fails?!
TaskScheduler instance is shutdown.
And yes you are correct if someone stops the TaskScheduler instance it is the equivalent of stopping the cron service...
btw: we are working on moving some of the cron/triggers capabilities to the backend , it will not be as flexi...
Hi @<1578193378640662528:profile|MoodySeaurchin4>
but is it possible to log some metrics too, like rmse or the likes? If so, how would you do it?
Sure, I'm assuming this is part of the output ? if not, this means this is part of your code, and if this is the case then yes you should use collect_custom_statistics_fn
None
`collect_custom_statistics_fn({'rmse'...