Reputation
Badges 1
611 × Eureka!There is no way to create an artifact/model/dataset without a task, right? Just always inherit from the parent task. And if cloned change the user to the user who did the clone.
(just for my own interest: how much does the enterprise version divert from the open source version? It it just extended or are there core changes to the enterprise version)
Let me check again.
Maybe something like this is how it is intended to be used?
` # run_with_clearml.py
def get_main_task():
task = Task.create(project="my_project", name="my_experiment", script="main_script.py")
return task
def run_standalone(task_factory):
Task.enqueue(task_factory())
def run_in_pipeline(task_factory):
pipe = Pipelinecontroller()
pipe.add_step(preprocess, ...)
pipe.add_step(base_task_factory=task_factory, ...)
pipe.add_step(postprocess, ...)
pipe.start()
if...
So with pipeline decorators can I implement this logic?
It seems like the services-docker is always started with Ubuntu 18.04, even when I usetask.set_base_docker( "continuumio/miniconda:latest -v /opt/clearml/data/fileserver/:{}".format( file_server_mount ) )
It seems like clearml removes the dev... from torch == 1.14.0.dev20221205+cu117 in the /tmp/ cached requirements.txt
[root@dc01deffca35 elasticsearch]# curl `
{
"cluster_name" : "clearml",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 10,
"active_shards" : 10,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 10,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_nu...
Can you ping me when it is updated in None so I can update my installation?
Could be clean log after restart. Unfortunately, I restarted the server right away 😞 I gonna post if it happens again with the appropriate logs.
Args is similar to what is shown in print(args) when executed remotely.
So missing args that are not specified are not None like intended, but just do not exists in args . And command is a list instead of a single str.
AnxiousSeal95 Thanks a lot. Seems to be working fine for me. I see the clearml-agent version that pip installs in the docker is now fixed to the host version 🙂 PyTorch Nightly is also installed correctly now!
Okay, this seems to work fine.
Also here is how I run my experiments right now, so I can execute them locally and remotely:
` # Initialize ClearML Task
task = (
Task.init(
project_name="examples",
task_name=args.name,
output_uri=True,
)
if track_remote or enqueue
else None
)
# Execute remotly via CLearML
if enqueue is not None and not running_remotely():
if enqueue == "None":
queue_name = None
task.reset()
...
I see. I was just wondering what the general approach is. I think PyTorch used to ship the pip package without CUDA packaged into it. So with conda it was nice to only install CUDA in the environment and not the host. But with pip, you had to use the host version as far as I know.
Yes, I am also talking about agents on different machines. I had two agents on the server machine, which also seem to have been killed. The ones on different machines kept working until 1 or 2 minutes after the clearml-server restarted.
Also clearml-agent at version 1.5 does not look for nightly at the correct indexes even of torch_nightly set to true in clearml.conf
Looking in indexes: https://pypi.org/simple , https://download.pytorch.org/whl/cu117/
Yea, correct! No problem. Uploading such large artifacts as I am doing seems to be an absolute edge case 🙂
Shows some logs, but nothing of relevance I think. Only Infos and Warning about deprecated stuff that is still used ;D ...
Here is some code that shows exactly what goes wrong. I do local execution only. It seems not to be related to remote execution as I thought, but more related to clearml.Task:
` args = parser.parse_args()
print(args) # FIRST OUTPUT
command = args.command
enqueue = args.enqueue
track_remote = args.track_remote
preset_name = args.preset
type_name = args.type
environment_name = args.environment
nvidia_docker = args.nvidia_docker
# Initialize ClearML Tas...
btw: I am pretty sure this used to work, but then stopped work some time ago.
Then I could also do this:# My custom very special use case task = Task() task = task.load_statedict(await Task.load_or_create(task_name)) await task.synchronize() await run_code_analysis() task.add_requirement("myreq") await task.synchronize()
I don't know actually. But Pytorch documentation says it can make a difference: https://pytorch.org/docs/stable/distributions.html#torch.distributions.distribution.Distribution.set_default_validate_args
Yea, but before in my original setup the config file was filled. I just added some lines to the config and now the error is back.
Maybe there is something wrong with my setup. Conda confuses me sometimes.