Reputation
Badges 1
25 × Eureka!Sorry, what I meant is that it is not documented anywhere that the agent should run in docker mode, hence my confusion
This is a good point! I'll make sure we stress it (BTW: it will work with elevated credentials, but probably not recommended)
I'm glad to hear 🙂
If you can reproduce it, let me know
MoodyCentipede68 seems you did not pass any configuration (os env or conf file) so it does nor know how to find the server and authenticate. Make sense?
that's the entire repo link ? not something like https://github.com/ ... ?
PompousParrot44 , so you mean like a base conda env?
Configuring trains-agent to use conda is done here:
https://github.com/allegroai/trains-agent/blob/699d13bbb34649c7e5337b4187cda59b7fa6fd33/docs/trains.conf#L44
Then for every experiment trains-agent will create a new conda environment based on the requirements of that experiment.
You can tell it to inherit the base conda env (or the one it is running from, I think) by settingsystem_site_packages: truehttps://github.com/allegroai/tr...
. I'm trying to run to get a task to run using a specific docker image and to source a bash script before execution of the python script.
Are you running an agent in docker mode ? if so you should be able to see the Output of your bash script first thing in the log
(and it will appear in the docker CMD)
SoreDragonfly16 could you reproduce the issue?
What's your OS? trains versions?
GiddyTurkey39
BTW: you can always add the missing package via code:Task.add_requirements('torch', optional_version)
I want is to manually provide a name to each series equal to the subject name (Subject 1, Subject 2, etc.)
They appear as they are reported to TB. I think this is a PyTorchLightning thing... If you look as the TB produced, you will get the same naming schemes, no?!
TrickySheep9 is this a conda package or a wheel you are installing manually ?
OSError: [Errno 28] No space left on deviceHi PreciousParrot26
I think this says it all 🙂 there is no more storage left to run all those subprocesses
btw:
I am curious about why a
ThreadPool
of
16
threads is gathered,
This is the maximum simultaneous jobs it will try to launch (it will launch more after the launching is doe, notice not the actual execution) but this is just a way to limit it.
When I passed specific arguments (for example --steps) it ignored them...
script.py test blah1 blah2 blah3 42
Is this how it is intended to be used ?
Hi PanickyMoth78
` torch.save(net.state_dict(), PATH) # auto-uploads to GCS
get all the models from the Task
output_models = Task.current_task().models["output"]
get the last one
last_model = output_models[-1]
set meta-data
last_model.set_metadata(key="my key", value="my value", type="str") `
Hi @<1601386194774528000:profile|AmusedPanda8>
I think the project name is ./model_training/trained_models/yolov8n-TEST_OKTODELETE/ and for some reason you have "." as a project project?
(notice jested projects are automatically created based on the project name with '/' as separator)
@<1523707653782507520:profile|MelancholyElk85> what are you trying to change ? maybe there is a better way?
BTW: if you do step_base_task.export_task() you can use the parts that you need in the dict and pass them to:task_overrides argument in add_step (you might need to flatten the nested arguments with '.' , and thinking about it, maybe we should do that automatically?!)
Hi JitteryCoyote63 report_frequency_sec=30. controller how frequently monitoring events are sent to the server, default is every 30 seconds (you can change the UI display to wall-time to review). You can change it to 180 so it will only send an event every 3 minutes (for example).
sample_frequency_per_sec is the sampling frequency it uses internally, then it will average the results over the course of the report_frequency_sec time window, and send the averaged result on the repo...
I am symlinking the .clearml directory to a NAS server and this is perhaps part of the problem.
Yep, that sounds about right, it uses Posix file system for internal lock mechanisms (multi process locks), and my guess is that the NAS for some reason does not support it...
I think it is on the JWT token the session gets from the server
a bit of a hack but should work 🙂
session = task.session # or Task._get_default_session()
my_user_id = session.get_decoded_token(session.token)['identity']['user']
where is the port? why https ?
Will they get ordered ascending or descending?
Good point, I'll check the docs... but I think they do not specify
https://clear.ml/docs/latst/docs/references/sdk/task#taskget_tasks
From the code it seems the ordered is not guaranteed.
You can however pass '-last_update' : order_by which will give you the latest updated first
` task_filter = {
'page_size': 2,
'page': 0,
'order_by': ['last_metrics.{}.{}'.format(title, series), '-last_update']
}
Task.get_tasks(...
bash: line 1: 1031 Aborted (core dumped)
@<1570583227918192640:profile|FloppySwallow46> seems like the processes crashed,
Oh I see the pipeline controller itself (not the components) is the one with the repo
To fix that add at the top of the script the following:
` from clearml import Task
Task.force_store_standalone_script()
@PipelineDecorator.pipeline(...) `That should do the trick
Hi RotundHedgehog76
we have issues with
clearml-agent
when using standalone mode. ...
What is the use case for standalone mode? is this venv or docker mode?
ShallowCat10 try something similar to this one, due notice that it might take a while to get all the task objects, so I would start with a single one 🙂
`
from trains import Task
tasks = Task.get_tasks(project_name='my_project')
for task in tasks:
scalars = task.get_reported_scalars()
for x, y in zip(scalars['title']['original_series']['x'], scalars['title']['original_series']['y']):
task.get_logger().report_scalar(title='title', series='new_series', value=y, iteration=...