Reputation
Badges 1
25 × Eureka!and the agent default runtime mode is docker correct?
Actually the default is venv mode, to run in docker mode add --docker
to the command line
So I could install all my system dependencies in my own docker image?
Correct, inside the docker it will inherit all the preinstalled packages, But it will also install any missing ones (based on the Task requirements. i.e. "installed packages" section)
Also what is the purpose of the
aws
block in the clearml.c...
Hi @<1562610699555835904:profile|VirtuousHedgehong97>
I think you need to upgrade your self-hosted clearml-server, could that be the case?
Notice this is only when:
Using Conda as package manager in the agent the requested python version is already installed (multiple python version installation on the same machine/container are supported)
Okay this seems correct...
Can you share both yaml files (server & serving) and env file?
@<1595587997728772096:profile|MuddyRobin9> are you sure it was able to spin the EC2 instance ? which clearml version autoscaler are you running ?
We were able to find a stable, free, open source, multiplatform way to do this
You mean to move the data from the gdrive to object storage ? or to just mount the gdrive ?
okay, wait I'll see if I can come up with something .
OK - the issue was the firewall rules that we had.
Nice!
But now there is an issue with the
Setting up connection to remote session
OutrageousSheep60 this is just a warning, basically saying we are using the default signed SSH server key (has nothing to do with the random password, just the identifying key being used for the remote ssh session)
Bottom line, I think you have everything working π
CooperativeFox72 we are aware of Pool throwing exception that causes things to hang. Fix will be deployed in 0.16 (due to be released tomorrow).
Do you have a code to reproduce it, so I can verify the fix solves the issue?
One way to circumvent this btw would be to also add/use theΒ
--python
Β flag forΒ
virtualenv
Notice that when creating the venv , the cmd that is used is basically pythonx.y -m virtualenv ...
By definition this will create a new venv based on the python that executes the venv.
With all that said, it might be there is a bug in virtualenv and in some cases, it does Not adhere to this restriction
I thought this is the issue on the thread you linked, did I miss something ?
Hi WackyRabbit7
Yes, we definitely need to work on wording there ...
"Dynamic" means you register a pandas object that you are constantly logging into while training, think for example the image files you are feeding into the network. Then Trains will make sure it is constantly updated & uploaded so you have a way to later verify/compare different runs and detect dataset contemplation etc.
"Static" is just, this is my object/file upload and store it as an artifact for me ...
Make sense ?
It might be broken for me, as I said the program works without the offline mode but gets interrupted and shows the results from above with offline mode.
How could I reproduce this issue ?
But there might be another issue in between of course - any idea how to debug?
I think I missed this one, what exactly is the issue ?
Pseudo-ish code:
create pipelinepipeline = Task.create(..., task_type="controller") pipeline.mark_started() print(pipeline.id)
2. launch step A (pass arguments via command line argument / os environment)
` task = Task.init(...)
pipeline_id = os.environ['MY_MAIN_PIPELINE']
pipeline_task = Task.get_task(task_id=pipeline_id)
send some metrics / reports etc.
pipeline_task.get_logger().report_scalar(...)
pipeline_task.get_logger().report_text(...) `wdyt? (obvioudly you need to somehow pass th...
No should be fine... Let me see if I can get a windows box π
EnviousStarfish54 could you send the conda / pip environment?
Maybe that's the diff between machine A/B ?
Is there any documentation on versioning for Datasets?
You mean how to select the version name ?
DefeatedCrab47 yes that is correct. I actually meant if you see it on the tensorboard's UI π
Anyhow if it there, you should find it in the Tasks Results Debug Samples
The task pod (experiment) started reaching out to an IP associated with malicious activity. The IP was associated with 1000+ domain names. The activity was identified in AWS guard duty with a high severity level.
BoredHedgehog47 What is the pod container itself ?
EDIT:
Are you suggesting the default "ubuntu:18.04" is somehow contaminated ?
https://hub.docker.com/layers/library/ubuntu/18.04/images/sha256-d5c260797a173fe5852953656a15a9e58ba14c5306c175305b3a05e0303416db?context=explore
TenseOstrich47 notice:task.logger.report_matplotlib_figure( title=f"Performance Heatmap - {name}", series="Device Brand Predictions", iteration=0, figure=figure, **report_image=True,** )
report_image=True means it will be uploaded as an image not a plot (like imshow), the default is False , which would put it under Plots section
Code you add a few prints, and see where it hangs ? there's no reason for it to hang (even the plot upload is done ...
Hi @<1545216070686609408:profile|EnthusiasticCow4>
The auto detection of clearml is based on the actual imported packages, not the requirements.txt of your entire python environment. This is why some of them are missing.
That said you can always manually add them
Task.add_requirements("hydra-colorlog") # optional add version="1.2.0"
task = Task.init(...)
(notice to call before Task.init)
Hi VexedCat68
can you supply more details on the issue ? (probably the best is to open a github issue, and have all the details there, so we have better visibility)
wdyt?
The versions don't need to match, any combination will work.
is removed from the experiment list?
You mean archived ?
BTW: I think we had a better example, I'll try to look for one
Hi FierceHamster54
Do I need to instantiate a task inside my component ? Seems a bit redundant....
Yes, so the idea is that the Task (along the code) will be automatically linked with the output model, for better traceability.
That said you can "import" a model into the system (i.e. it was created somewhere else and you want to register it with InputModel.import_model
https://clear.ml/docs/latest/docs/clearml_sdk/model_sdk#importing-models
I guess "Input" from that perspecti...
I am trying to see if the user can submit a list of resource requirements (e.g 4GPUs, 12 cores, 100GB diskspace) for the task when queuing the task and the agents pick these tasks if they have the requested resources. With this, the user need not think about which queue to send the task to. The users just state what they need and the agents do the scheduling for them.
Can I assume we are talking Kubernetes under the hood for the resource allocation ?