Reputation
Badges 1
981 × Eureka!line 13 is empty 🤔
Still investigating, task.data.last_iteration is correct (equal to engine.state["iteration"] ) when I resume the training
Hi AgitatedDove14 , I investigated further and got rid of a separate bug. I was able to get ignite’s events fired, but still no scalars logged 😞
There is definitely something wrong going on with the reporting of scalars using multi processes, because if my ignite callback is the following:
` def log_loss(engine):
idist.barrier(). # Sync all processes
device = idist.device()
print("IDIST", device)
from clearml import Task
Task.current_task().get_logger().r...
Configuration:
` {
"resource_configurations": {
"v100": {
"instance_type": "g4dn.2xlarge",
"availability_zone": "us-east-1a",
"ami_id": "ami-05e329519be512f1b",
"ebs_device_name": "/dev/sda1",
"ebs_volume_size": 100,
"ebs_volume_type": "gp3",
"key_name": "key.name",
"security_group_ids": [
"sg-asd"
],
"is_spot": false,
"extra_configura...
(Just to know if I should wait a bit or go with the first solution)
Are you planning to add a server-backup service task in the near future?
So the new EventsIterator is responsible for the bug.
Is there a way for me to easily force the WebUI to always use the previous endpoint (v1.7)? I saw in the diff changes v1.1.0 > v1.2.0 that ES version was bumped to 7.16.2. I am using an external ES cluster, and its version is still 7.6.2. Can it be that the incompatibility comes from here? I’ll update the cluster to make sure it’s not the case
with my hack yes, without, no
Yea so I assume that training my models using docker will be slightly slower so I'd like to avoid it. For the rest using docker is convenient
for some reason when cloning task A, trains sets an old commit in task B. I tried to recreate task A to enforce a new task id and new commit id, but still the same issue
It indeed has the old commit, so they match, no problem actually 🙂
Hi SuccessfulKoala55 , AgitatedDove14 ,
I updated to 1.4.0 (Web UI shows: WebApp: 1.5.0-186 • Server: 1.5.0-186 • API: 2.18 )
Unfortunately the bug is still there 😞
I don’t see errors in the console anymore though!
I had another look and modified a events.get_task_logs request with a super old timestamp to try to retrieve all logs, this returned me only the few logs already displayed in the console. So I think the problem doesn’t come from the WebUI, but from the...
Alright, thanks for the answer! Seems legit then 🙂
I came up with the same code, thanks for the fast answer (yes having a setter for that would be cool!)
AgitatedDove14 I now tested with a real experiment, it works, but I saw two issues:
It first doesnt detect torch, downloads it but then says that it is already installed so it doesn't install it. One of the dependency of my repository is another repository (repo-2 in the logs). Both my repositories require numpy . When installing the first repository, it says Requirement already satisfied: numpy in /home/workeruser/.local/lib/python3.6/site-packages . Correct. But then it says `...
I think it comes from the web UI of the version 1.2.0 of clearml-server, because I didn’t change anything else
Hi SuccessfulKoala55 , thanks for the idea! the function isn’t called with atexit.register() though, maybe the way the agent kills the task is not supported by atexit
Could you please point me to the relevant component? I am not familiar with typescript unfortunately 😞
I want the clearml-agent/instance to stop right after the experiment/training is “paused” (experiment marked as stopped + artifacts saved)
AgitatedDove14 I finally solved it: The problem was --network='host' should be --network=host
Not sure about that, I think you guys solved it with your PipelineController implementation. I would need to test it before giving any feedback 🙂
mmh it looks like what I was looking for, I will give it a try 🙂
The task is created using Task.clone() yes
AgitatedDove14 Is it fixed with trains-server 0.15.1?
Thanks for your answer! I am in the process of adding subnet_id/security_groups_id/key_name to the config to be able to ssh in the machine, will keep you informed 😄
I will go for lunch actually 😄 back in ~1h
I am confused now because I see in the master branch, the clearml.conf file has the following section:# Or enable credentials chain to let Boto3 pick the right credentials. # This includes picking credentials from environment variables, # credential file and IAM role using metadata service. # Refer to the latest Boto3 docs use_credentials_chain: falseSo it states that IAM role using metadata service should be supported, right?