Reputation
Badges 1
979 × Eureka!Should I open an issue in github clearml-agent repo?
It could be: I am running the clearml aws autoscaler in an ec2 instance having iam roles allowing for creating/deleting instances, but I get Warning! exception occurred: An error occurred (UnauthorizedOperation) when calling the RunInstances operation: You are not authorized to perform this operation. Encoded authorization failure message: ...
I suspect that since the agent is running in docker mode, the boto3 lib doesnโt automatically get the right permissions from the ec2-instance. To...
DeterminedCrab71 This is the behaviour of holding shift while selecting in Gmail, if ClearML could reproduce this, that would be perfect!
SuccessfulKoala55 I was able to recreate the indices in the new ES cluster. I specified number_of_shards: 4
for the events-log-d1bd92a3b039400cbafc60a7a5b1e52b
index. I then copied the documents from the old ES using the _reindex
API. The index is 7.5Gb on one shard.
Now I see that this index on the new ES cluster is ~19.4Gb ๐ค The index is divided into the 4 shards, but each shard is between 4.7Gb and 5Gb!
I was expecting to have the same index size as in the previous e...
Probably something's wrong with the instance, which AMI you used? the default one?
The default one is not existing/accessible anymore, I replaced it with the one that was shown in the NVIDIA Deep Learning AMI markplace page https://aws.amazon.com/marketplace/pp/B076K31M1S?qid=1610377938050&sr=0-1&ref_=srh_res_product_title that is: ami-04c0416d6bd8e4b1f
Yes, in the Task being executed in the agents, I have:from trains import Task task = Task.init(...) task.get_logger().report_text(str(task.get_parameters()))
Alright, experiment finished properly (all models uploaded). I will restart it to check again, but seems like the bug was introduced after that
Hi @<1523701087100473344:profile|SuccessfulKoala55> I was able to find the issue, I was creating a queue and worker subprocess that were not properly cleaned up
Doing it the other way around works:
` cfg = OmegaConf.create(read_yaml(conf_yaml_path))
config = task.connect(cfg)
type(config)
<class 'omegaconf.dictconfig.DictConfig'> `
with open(path, "r") as stream: return yaml.load(stream, Loader=yaml.FullLoader)
And so in the UI, in workers&queues tab, I see randomly one of the two experiments for the worker that is running both experiments
python3 -m trains_agent --config-file "~/trains.conf" daemon --queue default --log-level DEBUG --detached --gpus 1 > ~/trains-agent.startup.log 2>&1
That was also my feeling! But I though that spawning the trains-agent from a conda env would isolate me from cuda drivers on the system
Ho wow! is it possible to not specify a remote task? (If i am working with Task.set_offline(True))
very cool, good to know, thanks SuccessfulKoala55 ๐
SuccessfulKoala55 I tried to setup in a different machine the clearml-agent and now I get a different error message in the logs:Warning: could not locate requested Python version 3.6, reverting to version 3.6 clearml_agent: ERROR: Python executable with version '3.6' defined in configuration file, key 'agent.default_python', not found in path, tried: ('python3.6', 'python3', 'python')
Alright I have a followup question then: I used the param --user-folder โ~/projects/my-projectโ, but any change I do is not reflected in this folder. I guess I am in the docker space, but this folder is not linked to my the folder on the machine. Is it possible to do so?
Sure yes! As you can see I just added the blocklogging: driver: "json-file" options: max-size: "200k" max-file: "10"
To all services. Also in this docker-compose I removed the external binding of the ports for mongo/redis/es
but then why do I have to do task.connect_configuration(read_yaml(conf_path))._to_dict()
?
Why not task.connect_configuration(read_yaml(conf_path))
simply?
I mean what is the benefit of returning ProxyDictPostWrite
instead of a dict?
ExcitedFish86 I have several machines with different cuda driver/runtime versions, that I why you might be confused as I am referring to one or another ๐
I can ssh into the agent and:source /trains-agent-venv/bin/activate (trains_agent_venv) pip show pyjwt Version: 1.7.1
SuccessfulKoala55 I deleted all :monitor:machine
and :monitor:gpu
series, but only deleted ~20M documents out of 320M documents in the events-training_debug_image-xyz
. I would like now to understand which experiments contain most of the document to delete them. I would like to aggregate the number of document per experiment. Is there a way do that using the ES REST api?
Hi AgitatedDove14 , I upgraded to 1.3.1 and the bug of missing logs in the console is still thereโฆ ๐
I made another recording so that you can understand what it is about:
I enqueue a task the task starts, the logs shown in the console are very sparse I scroll up and down to try to fetch missing logs, without success I download the logs, open the file and there I see the full logs