
Reputation
Badges 1
981 × Eureka!Sure, itās because of a very annoying bug that I shared in this https://clearml.slack.com/archives/CTK20V944/p1648647503942759 , that I couldnāt solve so far.
Iām not sure you can downgrade that easily ...
Yea thatās what I thought, thatās a bit of pain for me now, I hope I can find a way to fix the bug somehow
I am not sure I can do both operations at the same time (migration + splitting), do you think itās better to do splitting first or migration first?
The fileĀ /tmp/.clearml_agent_out.j7wo7ltp.txt
Ā does not exist
Could be also related to https://allegroai-trains.slack.com/archives/CTK20V944/p1597928652031300
for some reason when cloning task A, trains sets an old commit in task B. I tried to recreate task A to enforce a new task id and new commit id, but still the same issue
It seems that around here, a Task that is created using init remotely in the main process gets its output_uri
parameter ignored
here is the function used to create the task:
` def schedule_task(parent_task: Task,
task_type: str = None,
entry_point: str = None,
force_requirements: List[str] = None,
queue_name="default",
working_dir: str = ".",
extra_params=None,
wait_for_status: bool = False,
raise_on_status: Iterable[Task.TaskStatusEnum] = (Task.TaskStatusEnum.failed, Task.Ta...
Hi SoggyFrog26 , https://github.com/allegroai/clearml/blob/master/docs/datasets.md
extra_configurations = {'SubnetId': "<subnet-id>"}
with brackets right?
amazon linux
and in the logs:
`
agent.worker_name = worker1
agent.force_git_ssh_protocol = false
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = ==20.2.3
agent.package_manager.system_site_packages = true
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = defaults
agent.package_manager.torch_nightly = false
agent.venvs_dir = /...
Is it safe to turn off replication while a reindex operation is happening? the reindexing is rather slow and I am wondering if turning of replication will speed up the process
Yes, that's what it looks like. Somehow when you clone the experiment repo, you correctly set the git creds in the url, but when the dependencies are installed, the git creds are not taken in account
There is a pinned github thread on https://github.com/allegroai/clearml/issues/81 , seems to be the right place?
AgitatedDove14 Same problem with clearml==1.1.5rc2
š , I also tried with backend==gloo
, still same problem
line 13 is empty š¤
See my answer in the issue - I am not using docker
Sure, just sent you a screenshot in PM
I get the following error:
But I see in the agent logs:Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', ...
Yes, thanks! In my case, I was actually using TrainsSaver from pytorch-ignite with a local path, then I understood looking at the code that under the hood it actually changed the output_uri of the current task, thats why my previous_task.output_uri = "
s3://my_bucket
" had no effect (it was placed BEFORE the training)
the latest version, but I think its normal: I set the TRAINS_WORKER_ID = "trains-agent":$DYNAMIC_INSTANCE_ID, where DYNAMIC_INSTANCE_ID is the ID of the machine
Answering myself: Yes, Task.set_base_docker
RTFM!!!