
Reputation
Badges 1
979 × Eureka!Awesome! (Broken link in migration guide, step 3: https://allegro.ai/docs/deploying_trains/trains_server_es7_migration/ )
So I changed ebs_device_name = "/dev/sda1"
, and now I correctly get the 100gb EBS volume mounted on /
. All good 👍
I think this is because this API is not available in elastic 5.6
SuccessfulKoala55 Thanks to that I was able to identify the most expensive experiments. How can I count the number of documents for a specific series? Ie. I suspect that the loss, that is logged every iteration, is responsible for most of the documents logged, and I want to make sure of that
Stopping the server Editing the docker-compose.yml file, adding the logging section to all services Restarting the serverDocker-compose freed 10Go of logs
Ok, I could reproduce with Firefox and Chromium. Steps:
Add creds (either via the popup or in the settings) Go the /settings/webapp-configuration -> Creds should be there Hit F5 Creds are gone
I hitted enter too fast ^^
Installing them globally via$ pip install numpy opencv torch
will install locally with warning:Defaulting to user installation because normal site-packages is not writeable
, therefore the installation will take place in ~/.local/lib/python3.6/site-packages
, instead of the default one. Will this still be considered as global site-packages
and still be included in experiments envs? From what I tested it does
Hi CostlyOstrich36 , I mean insert temporary access keys
I just read, I do have the trains version 0.16 and the experiment is created with that version
ok, thanks SuccessfulKoala55 !
Or even better: would it be possible to have a support for HTML files as artifacts?
Some context: I am trying to log an HTML file and I would like it to be easily accessible for preview
AgitatedDove14 WOW, thanks a lot! I will dig into that 🚀
Sorry, what I meant is that it is not documented anywhere that the agent should run in docker mode, hence my confusion
If I remove security_group_ids
and just let subnet_id
in the configuration, it is not taken into account (the instances are created in a the default subnet)
Hi SuccessfulKoala55 , there it is > https://github.com/allegroai/clearml-server/issues/100
Also maybe we are not on the same page - by clean up, I mean kill a detached subprocess on the machine executing the agent
Yes, it works now! Yay!
I think we should switch back, and have a configuration to control which mechanism the agent uses , wdyt? (edited)
That sounds great!
The jump in the loss when resuming at iteration 31 is probably another issue -> for now I can conclude that:
I need to set sdk.development.report_use_subprocess = false
I need to call task.set_initial_iteration(0)
So it looks like the agent, from time to time thinks it is not running an experiment
For some reason the configuration object gets updated at runtime toresource_configurations = null queues = null extra_trains_conf = "" extra_vm_bash_script = ""
Now, I know the experiments having the most metrics. I want to downsample these metrics by 10, ie only keep iterations that are multiple of 10. How can I query (to delete) only the documents ending with 0?
Could be also related to https://allegroai-trains.slack.com/archives/CTK20V944/p1597928652031300
AgitatedDove14 I finally solved it: The problem was --network='host'
should be --network=host
There is no need to add creds on the machine, since the EC2 instance has an attached IAM profile that grants access to s3. Boto3 is able retrieve the files from the s3 bucket
the latest version, but I think its normal: I set the TRAINS_WORKER_ID = "trains-agent":$DYNAMIC_INSTANCE_ID, where DYNAMIC_INSTANCE_ID is the ID of the machine