and then run my script from terminal normally... (in the case of the environment variable I passed it before the python command)
Then this is by default the free space on the home folder (`~/.clearml') that is missing free space
I agree, but setting the agent’s env variable TMPDIR didn’t seem to have any effect (check the log above, it is still using /tmp
)
oh sorry my bad, then you probably need to define all OS environment variable for python temp folder for the agent (the Task process itself is a child process so it will inherit it)
TMPDIR/new/tmp TMP=/new/tmp TEMP=/new/tmp clearml-agent daemon ...
To give more context, he is running an hyper params optimization script, that internally clones a base task and runs it with certain params and checks if a metric increases or decreases. It is when the agent tries to run this task that the error raises.
ERROR: Could not install packages due to an EnvironmentError: [Errno 28] No space left on device
clearml_agent: ERROR: Could not install task requirements!
Command '['~/.clearml/venvs-builds/3.8/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/tmp/cached-reqsot4de9w6.txt']' returned non-zero exit status 1.
Is it possible that the agent is somehow limiting the space for the environment creation @<1523701087100473344:profile|SuccessfulKoala55> ?
Because if he runs the same command in a console the install works
Sure @<1523701087100473344:profile|SuccessfulKoala55> ! Here it is!
@<1523703080200179712:profile|NastySeahorse61> how are you running the agent? What is the command line? And how are you passing the environment variable you mentioned?
also I suggested to change TMPDIR env variable, since /tmp/ didn’t have a lot of space.
agent.environment.TMPDIR = ****
is it ok to see *
**
*
instead of the actual path?
oh ok, I was wondering if this could have been an issue:agent.venvs_cache.free_space_threshold_gb = 2.0
line 120 says unmark to enable venv caching (it comes commented by default, but since I’m copying my conf it isn’t commented there)
# ClearML SDK configuration file
api {
# Notice: 'host' is the api server (default port 8008), not the web server.
api_server:
web_server:
files_server:
# Credentials are generated using the webapp,
# Override with os environment: CLEARML_API_ACCESS_KEY / CLEARML_API_SECRET_KEY
credentials {"access_key": REMOVED, "secret_key": REMOVED}
}
sdk {
# ClearML - default SDK configuration
storage {
cache {
# Defaults to system temp folder / cache
default_base_dir: "~/.clearml/cache"
# default_cache_manager_size: 100
}
direct_access: [
# Objects matching are considered to be available for direct access, i.e. they will not be downloaded
# or cached, and any download request will return a direct reference.
# Objects are specified in glob format, available for url and content_type.
{ url: "file://*" } # file-urls are always directly referenced
]
}
metrics {
# History size for debug files per metric/variant. For each metric/variant combination with an attached file
# (e.g. debug image event), file names for the uploaded files will be recycled in such a way that no more than
# X files are stored in the upload destination for each metric/variant combination.
file_history_size: 100
# Max history size for matplotlib imshow files per plot title.
# File names for the uploaded images will be recycled in such a way that no more than
# X images are stored in the upload destination for each matplotlib plot title.
matplotlib_untitled_history_size: 100
# Limit the number of digits after the dot in plot reporting (reducing plot report size)
# plot_max_num_digits: 5
# Settings for generated debug images
images {
format: JPEG
quality: 87
subsampling: 0
}
# Support plot-per-graph fully matching Tensorboard behavior (i.e. if this is set to true, each series should have its own graph)
tensorboard_single_series_per_graph: false
}
network {
metrics {
# Number of threads allocated to uploading files (typically debug images) when transmitting metrics for
# a specific iteration
file_upload_threads: 4
# Warn about upload starvation if no uploads were made in specified period while file-bearing events keep
# being sent for upload
file_upload_starvation_warning_sec: 120
}
iteration {
# Max number of retries when getting frames if the server returned an error (http code 500)
max_retries_on_server_error: 5
# Backoff factory for consecutive retry attempts.
# SDK will wait for {backoff factor} * (2 ^ ({number of total retries} - 1)) between retries.
retry_backoff_factor_sec: 10
}
}
aws {
s3 {
# S3 credentials, used for read/write access by various SDK elements
# Default, used for any bucket not specified below
region: ""
# Specify explicit keys
key: ""
secret: ""
# Or enable credentials chain to let Boto3 pick the right credentials.
# This includes picking credentials from environment variables,
# credential file and IAM role using metadata service.
# Refer to the latest Boto3 docs
use_credentials_chain: false
# Additional ExtraArgs passed to boto3 when uploading files. Can also be set per-bucket under "credentials".
extra_args: {}
credentials: [
# specifies key/secret credentials to use when handling s3 urls (read or write)
# {
# bucket: "my-bucket-name"
# key: "my-access-key"
# secret: "my-secret-key"
# },
# {
# # This will apply to all buckets in this host (unless key/value is specifically provided for a given bucket)
# host: "my-minio-host:9000"
# key: "12345678"
# secret: "12345678"
# multipart: false
# secure: false
# }
]
}
boto3 {
pool_connections: 512
max_multipart_concurrency: 16
}
}
google.storage {
# # Default project and credentials file
# # Will be used when no bucket configuration is found
# project: "clearml"
# credentials_json: "/path/to/credentials.json"
# pool_connections: 512
# pool_maxsize: 1024
# # Specific credentials per bucket and sub directory
# credentials = [
# {
# bucket: "my-bucket"
# subdir: "path/in/bucket" # Not required
# project: "clearml"
# credentials_json: "/path/to/credentials.json"
# },
# ]
}
azure.storage {
# containers: [
# {
# account_name: "clearml"
# account_key: "secret"
# # container_name:
# }
# ]
}
log {
# debugging feature: set this to true to make null log propagate messages to root logger (so they appear in stdout)
null_log_propagate: false
task_log_buffer_capacity: 66
# disable urllib info and lower levels
disable_urllib3_info: true
}
development {
# Development-mode options
# dev task reuse window
task_reuse_time_window_in_hours: 72.0
# Run VCS repository detection asynchronously
vcs_repo_detect_async: true
# Store uncommitted git/hg source code diff in experiment manifest when training in development mode
# This stores "git diff" or "hg diff" into the experiment's "script.requirements.diff" section
store_uncommitted_code_diff: true
# Support stopping an experiment in case it was externally stopped, status was changed or task was reset
support_stopping: true
# Default Task output_uri. if output_uri is not provided to Task.init, default_output_uri will be used instead.
default_output_uri: ""
# Default auto generated requirements optimize for smaller requirements
# If True, analyze the entire repository regardless of the entry point.
# If False, first analyze the entry point script, if it does not contain other to local files,
# do not analyze the entire repository.
force_analyze_entire_repo: false
# If set to true, *clearml* update message will not be printed to the console
# this value can be overwritten with os environment variable CLEARML_SUPPRESS_UPDATE_MESSAGE=1
suppress_update_message: false
# If this flag is true (default is false), instead of analyzing the code with Pigar, analyze with `pip freeze`
detect_with_pip_freeze: false
# Log specific environment variables. OS environments are listed in the "Environment" section
# of the Hyper-Parameters.
# multiple selected variables are supported including the suffix '*'.
# For example: "AWS_*" will log any OS environment variable starting with 'AWS_'.
# This value can be overwritten with os environment variable CLEARML_LOG_ENVIRONMENT="[AWS_*, CUDA_VERSION]"
# Example: log_os_environments: ["AWS_*", "CUDA_VERSION"]
log_os_environments: []
# Development mode worker
worker {
# Status report period in seconds
report_period_sec: 2
# ping to the server - check connectivity
ping_period_sec: 30
# Log all stdout & stderr
log_stdout: true
# Carriage return (\r) support. If zero (0) \r treated as \n and flushed to backend
# Carriage return flush support in seconds, flush consecutive line feeds (\r) every X (default: 10) seconds
console_cr_flush_period: 10
# compatibility feature, report memory usage for the entire machine
# default (false), report only on the running process and its sub-processes
report_global_mem_used: false
}
}
# Apply top-level environment section from configuration into os.environ
apply_environment: false
# Top-level environment section is in the form of:
# environment {
# key: value
# ...
# }
# and is applied to the OS environment as `key=value` for each key/value pair
# Apply top-level files section from configuration into local file system
apply_files: false
# Top-level files section allows auto-generating files at designated paths with a predefined contents
# and target format. Options include:
# contents: the target file's content, typically a string (or any base type int/float/list/dict etc.)
# format: a custom format for the contents. Currently supported value is `base64` to automatically decode a
# base64-encoded contents string, otherwise ignored
# path: the target file's path, may include ~ and inplace env vars
# target_format: format used to encode contents before writing into the target file. Supported values are json,
# yaml, yml and bytes (in which case the file will be written in binary mode). Default is text mode.
# overwrite: overwrite the target file in case it exists. Default is true.
#
# Example:
# files {
# myfile1 {
# contents: "The quick brown fox jumped over the lazy dog"
# path: "/tmp/fox.txt"
# }
# myjsonfile {
# contents: {
# some {
# nested {
# value: [1, 2, 3, 4]
# }
# }
# }
# path: "/tmp/test.json"
# target_format: json
# }
# }
}
Well, the agent actually can't limits this space even if we wanted to 🙂
Thanks so much for all your help @<1523701205467926528:profile|AgitatedDove14> @<1523702868694011904:profile|AbruptCow41> @<1523701087100473344:profile|SuccessfulKoala55>
agree, but setting the agent’s env variable TMPDIR
I think this needs to be passed to the docker with -e TMPDIR=/new/tmp
as additional container args:
see example
None
wdyt?
any idea what could be the issue @<1523701087100473344:profile|SuccessfulKoala55> ?
Can you share the agent's/task full log when running this task?
@<1523703080200179712:profile|NastySeahorse61> so glad you managed to solve it 🎊 🚀
Hi @<1523701087100473344:profile|SuccessfulKoala55> . I'm trying to run an optimization task, based on a previous experiment. I ran the agent like this:
clearml-agent daemon --queue my_queue -d
Then the only other option is the /tmp
is out of space (pip uses it to uncompress the .whl files, then it deletes them)
wdyt?
The specific problem I have is this error message:
ERROR: Could not install packages due to an EnvironmentError: [Errno 28] No space left on device
clearml_agent: ERROR: Could not install task requirements!
But I'm sure I have space... I wanted to use a predefined virtualenv as a walkaround, but maybe I can fix this original problem
I had tried adding those environment variables, but not in the agents 🙈
can you share your clearml.conf
file (remove the critical information first)?
I don’t see an agent section there 😕
Can you move your current clearml.conf
file to another location and run clearml-agent init
?
Hi Lema Gabriel, thank you very much for your answer. I'm just using the defaults... Should I change something in the configuration?
I'll attach my config just in case
@<1523703080200179712:profile|NastySeahorse61> / @<1523702868694011904:profile|AbruptCow41>
Is there a way to avoid each task to create a new environment?
You can just define CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
it will just use whatever you have there (notice it will totally ignore requirements.txt and "installed packages" on the Task)
BTW I would recommend turning on the venv caching, this is per docker/python/packages caching so the next time you are using th exact requirements it just pulls it from the cache and attaches to the container.
Un-comment this line
None