CostlyOstrich36 Were you able to reproduce it? That’s rather annoying 😅
Hi CostlyOstrich36 , one more observation: it looks like when I don’t open the experiment in the webUI before it is finished, then I get all the logs correctly. It is when I open the experiment in the webUI while it is running that I don’t see all the logs.
So it looks like there is an effect of caching (the logs are retrieved only once, when I open the experiment for the first time), and not afterwards (or rarely). Is that possible?
Hi SuccessfulKoala55 , will I be able to update all references to the old s3 bucket using this command?
CostlyOstrich36 yes, when I scroll up, a new events.get_task_log is fired and the response doesn’t contain any log (but it should)
Hi CostlyOstrich36 , this weekend I took a look at the diffs with the previous version ( https://github.com/allegroai/clearml-server/compare/1.1.1...1.2.0# ) and I saw several changes related to the scrolling/logging:
apiserver/bll/event/ http://log_events_iterator.py apiserver/bll/event/ http://events_iterator.py apiserver/config/default/services/_mongo.conf apiserver/database/model/ http://base.py apiserver/services/ http://events.pyI suspect that one of these changes might be responsible ...
CostlyOstrich36 I updated both agents to 1.1.2 and still go the same problem unfortunately. Since I can download the full log file from the Web UI, I guess the agents are reporting correctly?
Could it be that the elasticsearch does not return all the requested logs when it is queried from the WebUI to display it in the console?
Now that I think about it, I remember that on the changelog of the clearml-server 1.2.0 the following is listed:
` Fix UI Workers & Queues and Experiment Table pages ...
SmugDolphin23 Actually adding agent.python_binary
didn't work, it was not read by the clearml agent (in the logs dumped by the agent, agent.python_binary =
(no value)
what about the stacktrace of the error:Error: Can not start new instance, An error occurred (InvalidParameterValue) when calling the RunInstances operation: Invalid availability zone: [eu-west-2]
?
Does what you suggested here > https://github.com/allegroai/trains-agent/issues/18#issuecomment-634551232 also applies for containers used by the services queue?
region is empty, I never entered it and it worked
ok, what is your problem then?
That gave me
Running in Docker mode (v19.03 and above) - using default docker image: nvidia/cuda running python3
Building Task 94jfk2479851047c18f1fa60c1364b871 inside docker: ubuntu:18.04
Starting docker build
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0000] error waiting for container: context canceled
(btw, yes I adapted to use Task.init(...output_uri=)
Alright, thanks for the answer! Seems legit then 🙂
File "devops/valid.py", line 80, in valid(parse_args) File "devops/valid.py", line 41, in valid valid_task.output_uri = args.artifacts File "/data/.trains/venvs-builds/3.6/lib/python3.6/site-packages/trains/task.py", line 695, in output_uri ", check configuration file ~/trains.conf".format(value)) ValueError: Could not get access credentials for 's3://ml-artefacts' , check configuration file ~/trains.conf
AgitatedDove14 This seems to be consistent even if I specify the absolute path to /home/user/trains.conf
Nice, the preview param will do 🙂 btw, I love the new docs layout!
I did that recently - what are you trying to do exactly?
AMI ami-08e9a0e4210f38cb6
, region: eu-west-1a
Default would be venv, only use docker if an image is passed. Use case: not have to duplicate all queues to accept both docker and venv agents on the same instances
So either I specify in the clearml-agent agent.python_binary: python3.8 as you suggested, or I enforce the task locally to run with python3.8 using task.data.script.binary
AgitatedDove14 In theory yes there is no downside, in practice running an app inside docker inside a VM might introduce slowdowns. I guess it’s on me to check whether this slowdown is negligible or not
Try to spin up the instance of that type manually in that region to see if it is available
did you try with another availability zone?
I would probably leave it to the ClearML team to answer you, I am not using the UI app and for me it worked just well with different regions. Maybe check permissions of the key/secrets?
So when I create a task using `task = Task.init(project_name=config.get("project_name"), task_name=config.get("task_name"), task_type=Task.TaskTypes.training, output_uri=" s3://my-bucket ") locally, the artifact is correctly logged remotely, but when I create the task remotely (from an agent) the artifact is logged locally (in the agent machine, not on s3)
AgitatedDove14 I now tested with a real experiment, it works, but I saw two issues:
It first doesnt detect torch, downloads it but then says that it is already installed so it doesn't install it. One of the dependency of my repository is another repository (repo-2 in the logs). Both my repositories require numpy
. When installing the first repository, it says Requirement already satisfied: numpy in /home/workeruser/.local/lib/python3.6/site-packages
. Correct. But then it says `...