Reputation
Badges 1
981 × Eureka!If I remove security_group_ids and just let subnet_id in the configuration, it is not taken into account (the instances are created in a the default subnet)
(Even if I explicitely do my_task.close() )
So in my use case each step would create a folder (potentially big) and would store it as an artifact. The last step should โmergeโ all the pervious folders. The idea is to split the work among multiple machines (in parallel). I would like to avoid that these potentially big folder artifacts are also stored in the pipeline task, because this one will be running on the services queue in the clearml-server instance, that will definitely not have enough space to handle all of them
I will try with that and keep you updated
Sure yes! As you can see I just added the blocklogging: driver: "json-file" options: max-size: "200k" max-file: "10"To all services. Also in this docker-compose I removed the external binding of the ports for mongo/redis/es
I have the same problem, but not only with subprojects, but for all the projects, I get this blank overview tab as shown in the screenshot. It only worked for one project, that I created one or two weeks ago under 0.17
Hi AgitatedDove14 , thanks for the answer! I will try adding 'multiprocessing_context='forkserver' to the DataLoader. In the issue you linked, nirraviv mentionned that forkserver was slower and shared a link to another issue https://github.com/pytorch/pytorch/issues/15849#issuecomment-573921048 where someone implemented a fast variant of the DataLoader to overcome the speed problem.
Did you experiment any drop of performances using forkserver? If yes, did you test the variant suggested i...
python3 -m trains_agent --config-file "~/trains.conf" daemon --queue default --log-level DEBUG --detached --gpus 1 > ~/trains-agent.startup.log 2>&1
ok, now I actually remember why I used _update_requirements instead of add_requirements: The first overwrites all the other, the later only add to the already detected packages. Since my deps are listed in the dependencies of my setup.py, I don't want clearml to list the dependencies of the current environment
SInce it fails on the first machine (clearml-server), I try to run it on another, on-prem machine (also used as an agent)
CostlyOstrich36 Were you able to reproduce it? Thatโs rather annoying ๐
I want the clearml-agent/instance to stop right after the experiment/training is โpausedโ (experiment marked as stopped + artifacts saved)
what about the stacktrace of the error:Error: Can not start new instance, An error occurred (InvalidParameterValue) when calling the RunInstances operation: Invalid availability zone: [eu-west-2]?
But I can do:
` $ python
import torch
torch.cuda.is_available()
True
torch.backends.cudnn.version()
8005 `
To help you debugging this: in the /dashboard endpoint, all projects were still there, but empty (no experiment inside). No experiments archived as well.
Very cool! Run two train-agent daemons, one per GPU on the same machine, with default Nvidia/CUDA Docker This is close to my use case, I just would like to run these two daemons not with docker, would that be possible? I should just remove the --docker nvidia/cuda param right?
Still failing with the same error ๐
Hi CostlyOstrich36 , one more observation: it looks like when I donโt open the experiment in the webUI before it is finished, then I get all the logs correctly. It is when I open the experiment in the webUI while it is running that I donโt see all the logs.
So it looks like there is an effect of caching (the logs are retrieved only once, when I open the experiment for the first time), and not afterwards (or rarely). Is that possible?
Yes! not a strong use case though, rather I wanted to ask if it was supported somehow
but if the task is now running on an agent, isnโt is possible source of conflict? I would expect that after calling Task.enqueue(exit=True), the local task is closed and no processes related to it is running
Thanks a lot AgitatedDove14 !
mmmmh I just restarted the experiment and it seems to work now. I am not sure why that happened. From this SO it could be related to size of the repo. Might be a good idea to clone with --depth 1 in the agents?
Or more generally, try to catch this error and retry a few times?
Hoo I found:user@trains-agent-1: ps -ax 5199 ? Sl 29:25 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached 6096 ? Sl 30:04 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached
Will the from clearml import Task raise an error if no clearml.conf exists? Or only when actual features requiring to define the server (such as Task.init ) will be called
Yes, it did spin two instances for the same task
Thanks for your inputs, I will try that! For completion, here is how I retrieve the parameters:
` from trains import Task
task = Task.init("test", "test")
parent_task = Task.get_task(task.parent)
task.get_logger().report_text(task.get_parameters())
artifact_name = task.get_parameter("General/artifact_name")
artifact = parent_task.artifacts[artifact_name].get() `