Reputation
Badges 1
25 × Eureka!And you have the exact same folder structure / content, and server A/B give a different set of experiments ?
(is serverB empty, meaning no experiments at all?)
Could it be in a python at_exit event ?
btw: what's the OS and python version?
task.mark_completed()
You have that at the bottom of the script, never call it on yourself, it will kill the actual process.
So what is going on you are marking your own process for termination, then it terminates itself leaving the interpreter and this is the reason for the errors you are seeing
The idea of mark_* is to mark an external Task, forcefully.
By just completing your process with exit code (0) (i.e. no error) the Task will be marked as completed anyhow, no need to call...
HealthyStarfish45 if I understand correctly the trains-agent is running as daemon (i.e. automatically pulling jobs and executes them), the only point might be cancelling a daemon will cause the Task executed by that daemon to be canceled as well.
Other than that, sounds great!
I was thinking mainly about AWS.
Meaning S3?
Hi SharpHedgehog60
Task type is another way to declare the type of processing the Task performs.
Later you can filter based on the Task type (like you would with a Tag).
For example Datasets are always of a Type "data processing"
Thanks ElegantCoyote26 I'll look into it. Seems like someone liked our automagical approach 🙂
Interesting use case, do you already have the connect_configuration in the code? or do we need to somehow create it ?
is it normal that it's slower than my device even though the agent is much more powerful than my device? or because it is just a simple code
Could be the agent is not using the GPU for some reason?
WackyRabbit7 basically starting v1.1 if you are running code without any configuration file, you will get an error (in contrast to previous versions where it defaulted to the demo-server)
Hmm maybe this is the issue, :
Conda error: UnsatisfiableError: The following specifications were found to be incompatible with a past
explicit spec that is not an explicit spec in this operation (cudatoolkit):
- pytorch~=1.8.0 -> cudatoolkit[version='>=10.1,<10.2|>=10.2,<10.3']
This makes no sense, conda is saying pytorch=1.8 needs cudatoolkit <10.2/10.3 but actually it needs cudatoolkit 11.1
Hi GrotesqueOctopus42
In theory it can be built, the main hurdle is getting elk/mongo/redis containers for arm64 ...
Hi WittyOwl57
I think what happens is it auto-logs the joblib load/save calls, these calls track models used/created by the code, and attach them to the model repository representing these model.
I'm assuming there are multiple load/save , and there are multiple model instances pointing to the same local file "file:///tmp/..." . The earning basically says it is re-registering existing models.
Make sense ?
Hi @<1533620191232004096:profile|NuttyLobster9>base_task_factory is a function that gets the node definition and returns a Task to be enqueued ,
pseudo code looks like:
def my_node_task_factory(node: PipelineController.Node) -> Task:
task = Task.create(...)
return task
Make sense ?
task.connect is two way, it does everything for you:base_params = dict(param1=123, param2='text') task.connect(base_params) print(base_params)If you run this code manually, then print is exactly what you initialized base_params with. But when the agent is running it, it will take the values from the UI (including casting to the correct type), so print will result in values/types from the UI.
Make sense ?
Hi ContemplativeCockroach39
Seems like you are running the exact code as in the git repo:
Basically it points you to the exact repository https://github.com/allegroai/clearml and the script examples/reporting/pandas_reporting.py
Specifically:
https://github.com/allegroai/clearml/blob/34c41cfc8c3419e06cd4ac954e4b23034667c4d9/examples/reporting/pandas_reporting.py
Yes, but only with git clone 🙂
It is not stored on ClearML, this way you can work with the experiment manager without explicitly giving away all your code 😉
So basically take data from tensorboard read it, and report it to the cloud ?
Hi JumpyPig73
import data from old experiments into the dashboard.
what do you mean by "old experiments" ?
I guess this is doable:
You can get the entire set of scalars like as pandas DF: https://www.tensorflow.org/tensorboard/dataframe_api
(another example: https://stackoverflow.com/a/45899735 )
Then iterate over the different runs and create + report scalars)
` from clearml import Task
for run in runs:
task = Task.create_task(...)
logger = task.get_logger()
not real code, just example:
w_times, step_nums, vals = zip(*event_acc.Scalars('Accuracy'))
for step, val in zip(step_nums...
If I have access to the logs, python env and git commits, is there an API to log those to the experiments too?
Sure:task.update_task see here:
https://clear.ml/docs/latest/docs/references/sdk/task#update_task
example:task.update_task(task_data={'script': {'branch': 'new_branch', 'repository': 'new_repo'}})The easiest way to get all the different sections (they should be relatively self explanatory) is calling task.export_task() which returns a dict with all the fields yo...
ConfusedPig65 could you send the full log (console) of this execution?
Sorry @<1689446563463565312:profile|SmallTurkey79> just notice your reply
Hmm so I know the enterprise version has a built-in support for slurm, which would remove the need to deploy agents on the slurm cluster.
What you can do is on the SLURM login server (i.e. a machine that can run sbatch), write a simple script that pulls the Task ID from the queue and calls sbatch with clearml-agent execute --id <task_id_here> , would this be agood solution
CloudyHamster42 you mean that when you set sdk.metrics.tensorboard_single_series_per_graph to True and you rerun the experiment, you are still getting multiple series on the same graph?
What's your Trains version?
trains-agent doesn't run the clone, it is pip...
basically calling "pip install git+https://..."
Not sure you can pass extra arguments
Also, this is not a setup problem, otherwise it would have seen consistently failing ... this actually looks like a network issue.
The only thing I can think of is retrying to install if we get network error (not sure whats the exit code of pip though (maybe 9?)
quick update 1.0.2 will be ready in an hour, apologies 😞