Reputation
Badges 1
43 × Eureka!Hi @<1523701070390366208:profile|CostlyOstrich36> , this is what our devops engineer said:
the proxy-body-size limitation crashed for the Clearml api, for WEB and FileServer I set it to unlimited, but for the API I didn't change it.
Server (see screenshot). Thanks!
I don't see any console errors
It seems so, yes. I'm not the one who did the server migration, but as a user I believe this is when I started noticing the issue for new datasets created after the migration.
I actually have a question about your original code snipped, @<1556450111259676672:profile|PlainSeaurchin97> . I have been trying to figure out a way to access the task object when running remotely so that I can instantiate the logger but when I tried task_id = os.getenv("CLEARML_TASK_ID")
, it’s returning None
. I also tried Task.current_task()
and also got None
back. What is the recommended way to access the Task object from within the remote agent?
Hi @<1523701205467926528:profile|AgitatedDove14> , sorry for the delayed reply. So what you’re saying is to first kick off a new run and then rename the underlying Pipeline Task, which will cause that particular run to become a new pipeline name? But you have to do this only after you’ve started the run.
What would be most ideal would be to be able to right-click on a pipeline run and have a “clone” option, like you can with a task, where you can start a new run with a new name in a single ...
Hi Max, thanks very much for your message! I understand what you’re saying now, though I suppose this is not my issue since I’m not setting any of the decorator values with variables. I’ll post a query in the main channel with code snippets to see if anyone has ideas. Thank you!
Here's my example script:
from random import randint
from clearml import Task
if __name__ == "__main__":
task: Task = Task.init(
project_name="clearml-examples", task_name="try-to-make-logging-work"
)
task.execute_remotely(queue_name="5da90f42dd4c40edab972a4bef8eab04")
logger = task.get_logger()
for i in range(10):
logger.report_scalar("example plot", series="random", value=randint(0, 100), iteration=i)
Hi Martin, I see . That makes sense though I would have expected the behavior to be the same when running remotely the first time as well . In any case, this solved the issue for me . Thanks for looking at it
Yes, that did make it work in this case, thank you.
@<1523701205467926528:profile|AgitatedDove14> : FYI here it is None
To be clear Task.init()
was called initially. I had to call it again later in the code in order to get the current task object instead of Task.current_task()
, which only seems to work locally. That's the part that is not intuitive.
Hi @<1523701205467926528:profile|AgitatedDove14> , on the resource logging: I tried with a sleep test and it works when I'm running it from my local machine, but when I run remotely in an agent, i do not see resource logging.
And, similarly, with tensorboard logging, it works fine when running from my machine, but not when running remotely in an agent. For this, I've decided to just re-write the logging code to use ClearML's built-in logging methods, which work fine in the agent. Would stil...
Unfortunately, it's turning out to be quite time consuming to manually remove all of the private info in here. Is there a particular section of the log that would be useful to see? I can try to focus on just sharing that part.
Hi @<1523701205467926528:profile|AgitatedDove14> , CLEARML_TASK_ID
is set inside the agent's process, which is how I was able to get the task by running Task.get_task(environ["CLEARML_TASK_ID")
. However I believe I've sorted out how to make both the resource logging and the tensorboard logging work in the agent. It seems that using Task.current_task()
to get the task object does not work when running remotely, but calling Task.init()
again does work. And after having called ...
No, i'm not seeing that "Dataset Content" section. We have some older datasets that were copied from a prior server deployment that do have the section and it appears in the UI.
That could happen with any task when it’s cloned. To be honest, the cron and trigger schedulers probably deserve their own UI panel since they operate differently than other tasks. Ideally, a user would be able to add and remove jobs from the schedulers purely through the UI.
Okay, I take it back. os.getenv("CLEARML_TASK_ID")
does work. I forgot to rebuild my container after making the change. Thanks for bringing this option to my attention!
Okay so I discovered that setting -e CLEARML_AGENT_PACKAGE_PYTORCH_RESOLVE=none
solves the issue.
That said, if someone could explain to me why this error was occurring and why it only happens in the case of cloning, I'd love to understand. Thanks!
Hi @<1523701205467926528:profile|AgitatedDove14> , sure. I just need to scrape them for any sensitive info then i'll post to this thread. Thanks for your reply.
Correction: it works when I am running the code in my local VSCode session. I still don't get resource logging when I run in an agent. 🤔 . And on a similar topic, I have a separate task that is logging metrics with tensorboard. When running locally, I see the metrics appear in the "scalars" tab in ClearML, but when running in an agent, nothing. Any suggestions on where to look?
Hi @<1523701205467926528:profile|AgitatedDove14> , thanks so the code to be executed by the task needs to be provided to the Task.create()
method as script=some/path.py
or does it work to have something like
def my_node_task_factory(node: PipelineController.Node) -> Task:
task = Task.create(...)
my_function()
return task
@<1523701225533476864:profile|ObedientDolphin41> , I was searching for anyone having an issue like me and found this thread. I have created a simple pipeline using decorators and when I try to clone it in the UI, I get that base_task_id is empty
error. It works fine when triggered programmatically from my machine. I’m wondering if you could elaborate on how you utilized the
get_configuration_object
and set_configuration_object
methods to solve this? In my case, I’m not setting a...
Hi @<1523701070390366208:profile|CostlyOstrich36> , I would expect the loss_func
parameter to be FocalLoss
instead of ['FocalLoss', 'FocalLoss', 'FocalLoss', 'FocalLoss']
(and same for the validation_split_name
parameter. I will try to put together an example, though it might take a little time before I can do it.
I believe you should be able to set the queue_name
parameter to None
to accomplish this.
Ahhh okay, thank you. Perhaps in the future, it would be great to allow this from the UI as well?