psweep.psweep module

psweep.psweep module#

exception psweep.psweep.PsweepHashError[source]#

Bases: TypeError

add_note()#: Exception.add_note(note) – add a note to the exception

with_traceback()#: Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

psweep.psweep.capture_logs_wrapper(pset, func, capture_logs, db_field='_logs')[source]#

Capture and redirect stdout and stderr produced in func().

Note the limitations mentioned in [1]:

Note that the global side effect on sys.stdout means that this context manager is not suitable for use in library code and most threaded applications. It also has no effect on the output of subprocesses. However, it is still a useful approach for many utility scripts.

So if users rely on playing with sys.stdout/stderr in func(), then they should not use this feature and take care of logging themselves.

[1] https://docs.python.org/3/library/contextlib.html#contextlib.redirect_stdout

Return type:: dict

psweep.psweep.check_calc_dir(calc_dir, df)[source]#

Check calc dir for consistency with database.

Assuming dirs are named:

<calc_dir>/<pset_id1>
<calc_dir>/<pset_id2>
...

check if we have matching dirs to _pset_id values in the database.

psweep.psweep.df_ensure_dtypes(df, fill_value=<NA>)[source]#

Make sure that df’s dtype is object. Convert any pd.isna() values to fill_value.

This is part of our attempt to prevent pandas from doing type inference and conversion.

psweep.psweep.df_extract_dicts(df, py_types=False)[source]#

Convert df’s rows to dicts.

Parameters:

df (DataFrame)
py_types (bool) – If True, let Pandas (Series.to_dict()) decide types. It tries to return Python native types (e.g. it converts pd.NA to None.) Else, try to preserve types as they are in df.

Return type:

Sequence[dict]

psweep.psweep.df_extract_params(df, py_types=False)[source]#

Extract params (list of psets) from df.

Same as df_extract_dicts(), but limit columns to kind="pset" (see filter_cols()). This will reproduce the params fed to run() when following the prefix/postfix convention (see _get_col_filter()), meaning that the pset hashes will be the same.

Parameters:

df (DataFrame)
py_types (bool) – See df_extract_dicts()

Return type:

Sequence[dict]

Examples

>>> import psweep as ps
>>> from numpy.random import rand
>>> params=ps.pgrid(ps.plist("a", [1,2,3]), ps.plist("b", [77,88]))
>>> params
[{'a': 1, 'b': 77},
 {'a': 1, 'b': 88},
 {'a': 2, 'b': 77},
 {'a': 2, 'b': 88},
 {'a': 3, 'b': 77},
 {'a': 3, 'b': 88}]

>>> df=ps.run(func=lambda pset: dict(result_=rand()), params=params, save=False)
>>> ps.df_extract_params(df)
[{'a': 1, 'b': 77},
 {'a': 1, 'b': 88},
 {'a': 2, 'b': 77},
 {'a': 2, 'b': 88},
 {'a': 3, 'b': 77},
 {'a': 3, 'b': 88}]

psweep.psweep.df_extract_pset(df, pset_id, py_types=False)[source]#

Extract a single pset dict for pset_id from df.

This is a convenience function doing just

>>> df_extract_row(df, pset_id=pset_id, py_types=py_types, kind="pset")

Parameters:

df (DataFrame)
pset_id (str)
py_types (bool) – See df_extract_dicts()

Return type:

dict

psweep.psweep.df_extract_row(df, pset_id, kind=None, py_types=False)[source]#

Extract a single row dict for pset_id from df.

When kind is given, limit columns to this (see filter_cols()).

Parameters:

df (DataFrame)
pset_id (str)
py_types (bool) – See df_extract_dicts()
kind (str) – See filter_cols()

Return type:

dict

psweep.psweep.df_filter_conds(df, conds, op='and')[source]#

Filter DataFrame using bool arrays/Series/DataFrames in conds.

Fuse all bool sequences in conds using op. For instance, if op="and", then we logical-and them, which is equal to

>>> df[conds[0] & conds[1] & conds[2] & ...]

but conds can be programmatically generated while the expression above would need to be changed by hand if conds changes.

Parameters:

df (DataFrame) – DataFrame
conds (Sequence[Sequence[bool]]) – Sequence of bool masks, each of length len(df).
op (str) – Bool operator, used as numpy.logical_{op}, e.g. “and”, “or”, “xor”.

Return type:

DataFrame

Examples

>>> df=pd.DataFrame({'a': arange(10), 'b': arange(10)+4})
>>> c1=df.a > 3
>>> c2=df.b < 9
>>> c3=df.a % 2 == 0
>>> df[c1 & c2 & c3]
   a  b
4  4  8
>>> ps.df_filter_conds(df, [c1, c2, c3])
   a  b
4  4  8

psweep.psweep.df_print(df, index=False, special_cols=None, prefix_cols=False, cols=[], skip_cols=[])[source]#

Print DataFrame, by default without the index and prefix columns such as _pset_id.

Similar logic as in bin/psweep-db2table, w/o tabulate support but more features (skip_cols for instance).

Column names are always sorted, so the order of names in e.g. cols doesn’t matter.

Parameters:

df (DataFrame)
index (bool) – include DataFrame index
prefix_cols (bool) – include all prefix columns (_pset_id etc.), we don’t support skipping user-added postfix columns (e.g. result_)
cols (Sequence[str]) – explicit sequence of columns, overrides prefix_cols when prefix columns are specified
skip_cols (Sequence[str]) – skip those columns instead of selecting them (like cols would), use either this or cols; overrides prefix_cols when prefix columns are specified

Examples

>>> import pandas as pd
>>> df=pd.DataFrame(dict(a=rand(3), b=rand(3), _c=rand(3)))

>>> df
          a         b        _c
0  0.373534  0.304302  0.161799
1  0.698738  0.589642  0.557172
2  0.343316  0.186595  0.822023

>>> ps.df_print(df)
       a        b
0.373534 0.304302
0.698738 0.589642
0.343316 0.186595

>>> ps.df_print(df, prefix_cols=True)
       a        b       _c
0.373534 0.304302 0.161799
0.698738 0.589642 0.557172
0.343316 0.186595 0.822023

>>> ps.df_print(df, index=True)
          a        b
0  0.373534 0.304302
1  0.698738 0.589642
2  0.343316 0.186595

>>> ps.df_print(df, cols=["a"])
       a
0.373534
0.698738
0.343316

>>> ps.df_print(df, cols=["a"], prefix_cols=True)
       a       _c
0.373534 0.161799
0.698738 0.557172
0.343316 0.822023

>>> ps.df_print(df, cols=["a", "_c"])
       a       _c
0.373534 0.161799
0.698738 0.557172
0.343316 0.822023

>>> ps.df_print(df, skip_cols=["a"])
       b
0.304302
0.589642
0.186595

psweep.psweep.df_read(fn, fmt='pickle', **kwds)[source]#: Read DataFrame from file fn. See df_write().

psweep.psweep.df_to_json(df, **kwds)[source]#

Like df.to_json but with defaults for orient, date_unit, date_format, double_precision.

Parameters:

df (DataFrame) – DataFrame to convert
kwds – passed to df.to_json()

Return type:

str

psweep.psweep.df_update_pset_cols(df, pset_cols, fill_value=<NA>, copy=False)[source]#

Make sure that df has at least pset_cols columns. If not, add missing columns, fill with fill_value. Always refresh _pset_hash.

Return type:: DataFrame

psweep.psweep.df_update_pset_hash(df, copy=False)[source]#

Add or update _pset_hash column.

Return type:: DataFrame

psweep.psweep.df_write(fn, df, fmt='pickle', **kwds)[source]#

Write DataFrame to disk.

Parameters:

fn (str) – filename
df (DataFrame) – DataFrame to write
fmt – {'pickle', 'json'}
kwds – passed to pickle.dump() or df_to_json()

Return type:

None

psweep.psweep.filter_cols(cols, kind='pset')[source]#

Filter database field names (“columns”) by type.

Here we use the default package-wide naming convention (see _get_col_filter()).

Parameters:

kind (str) –

pre, prefix: things like _pset_id
post, postfix: results like result_
pset: neither of the above, params like a, b

Return type:

Sequence[str]

psweep.psweep.filter_params_dup_hash(params, hashes)[source]#

Return params with psets whose hash is not in hashes.

Use pset["_pset_hash"] if present, else calculate hash on the fly.

Parameters:

params (Sequence[dict])
hashes (Sequence[str])

Return type:

Sequence[dict]

psweep.psweep.filter_params_unique(params)[source]#

Reduce params to unique psets.

Use pset["_pset_hash"] if present, else calculate hash on the fly.

Parameters:: params (Sequence[dict])
Return type:: Sequence[dict]

psweep.psweep.flatten_dict(dct, join_str='_')[source]#

Flatten nested dict.

Will string-convert keys and join using join_str.

Return type:: dict

Examples

>>> ps.flatten_dict(dict(a=1, b=dict(c=2, d={23: 42})))
{'a': 1, 'b_c': 2, 'b_d_23': 42}

psweep.psweep.func_wrapper(pset, func, *, tmpsave=False, verbose=False, simulate=False)[source]#

Add those prefix fields (e.g. _time_utc) to pset which can be determined at call time.

Call func on exactly one pset. Return updated pset built from pset.update(func(pset)). Do verbose printing.

Return type:: dict

psweep.psweep.intspace(*args, dtype=<class 'numpy.int64'>, **kwds)[source]#

Like np.linspace but round to integers.

The length of the returned array may be lower than specified by num if rounding to ints results in duplicates.

Parameters:

*args – Same as np.linspace
**kwds – Same as np.linspace

psweep.psweep.itr(func)[source]#

Wrap func to allow passing args not as sequence.

Assuming func() requires a sequence as input: func([a,b,c]), allow passing func(a,b,c).

Return type:: Callable

psweep.psweep.itr2params(loops)[source]#

Transform the (possibly nested) result of a loop over plists (or whatever has been used to create psets) to a proper list of psets by flattening and merging dicts.

Examples

>>> a = ps.plist('a', [1,2])
>>> b = ps.plist('b', [77,88])
>>> c = ps.plist('c', ['const'])

>>> # result of loops
>>> list(itertools.product(a,b,c))
[({'a': 1}, {'b': 77}, {'c': 'const'}),
 ({'a': 1}, {'b': 88}, {'c': 'const'}),
 ({'a': 2}, {'b': 77}, {'c': 'const'}),
 ({'a': 2}, {'b': 88}, {'c': 'const'})]

>>> # flatten into list of psets
>>> ps.itr2params(itertools.product(a,b,c))
[{'a': 1, 'b': 77, 'c': 'const'},
 {'a': 1, 'b': 88, 'c': 'const'},
 {'a': 2, 'b': 77, 'c': 'const'},
 {'a': 2, 'b': 88, 'c': 'const'}]

>>> # also more nested stuff is no problem
>>> list(itertools.product(zip(a,b),c))
[(({'a': 1}, {'b': 77}), {'c': 'const'}),
 (({'a': 2}, {'b': 88}), {'c': 'const'})]

>>> ps.itr2params(itertools.product(zip(a,b),c))
[{'a': 1, 'b': 77, 'c': 'const'},
 {'a': 2, 'b': 88, 'c': 'const'}]

Notes

When merging dicts, we don’t allow dicts to have same keys, as in

>>> [{'a': 1, 'b': 23}, {'a': 77, 'c': 'const'}]

since this would lead to unexpected results, for example when people use pgrid() to merge together params of previous studies to create params for a new study.

psweep.psweep.logspace(start, stop, num=50, offset=0, log_func=<ufunc 'log10'>, **kwds)[source]#

Like numpy.logspace but

start and stop are not exponents but the actual bounds
tuneable log scale strength

Control the strength of the log scale by offset, where we use by default log_func=np.log10 and base=10 and return np.logspace(np.log10(start + offset), np.log10(stop + offset)) - offset. offset=0 is equal to np.logspace(np.log10(start), np.log10(stop)). Higher offset values result in more evenly spaced points.

Parameters:

start – same as in np.logspace
stop – same as in np.logspace
num – same as in np.logspace
**kwds – same as in np.logspace
offset – Control strength of log scale.
log_func (Callable) – Must match base (pass that as part of **kwds). Default is base=10 as in np.logspace and so log_func=np.log10. If you want a different base, also provide a matching log_func, e.g. base=e, log_func=np.log.

Examples

Effect of different offset values:

>>> from matplotlib import pyplot as plt
>>> from psweep import logspace
>>> import numpy as np
>>> for ii, offset in enumerate([1e-16,1e-3, 1,2,3]):
...     x=logspace(0, 2, 20, offset=offset)
...     plt.plot(x, np.ones_like(x)*ii, "o-", label=f"{offset=}")
>>> plt.legend()

psweep.psweep.makedirs(path)[source]#

Create path recursively, no questions asked.

Return type:: None

psweep.psweep.merge_dicts(args, *, allow_dup_keys=True)[source]#

Start with an empty dict and update with each arg dict left-to-right.

Parameters:

args (Sequence[dict]) – dicts to merge
allow_dup_keys – Whether to allow later dicts to overwrite entries of former ones.

Return type:

dict

psweep.psweep.pgrid(plists)[source]#

Convenience function for the most common loop: nested loops with itertools.product: ps.itr2params(itertools.product(a,b,c,...)).

Parameters:: plists (Sequence[Sequence[dict]]) – List of plist() results. If more than one, you can also provide plists as args, so pgrid(a,b,c) instead of pgrid([a,b,c]).
Return type:: Sequence[dict]

Examples

>>> a = ps.plist('a', [1,2])
>>> b = ps.plist('b', [77,88])
>>> c = ps.plist('c', ['const'])
>>> # same as pgrid([a,b,c])
>>> ps.pgrid(a,b,c)
[{'a': 1, 'b': 77, 'c': 'const'},
 {'a': 1, 'b': 88, 'c': 'const'},
 {'a': 2, 'b': 77, 'c': 'const'},
 {'a': 2, 'b': 88, 'c': 'const'}]

>>> ps.pgrid(zip(a,b),c)
[{'a': 1, 'b': 77, 'c': 'const'},
 {'a': 2, 'b': 88, 'c': 'const'}]

Notes

For a single plist arg, you have to use pgrid([a]). pgrid(a) won’t work. However, this edge case (passing one plist to pgrid) is not super useful, since
```
>>> a=ps.plist("a", [1,2,3])
>>> a
[{'a': 1}, {'a': 2}, {'a': 3}]
>>> ps.pgrid([a])
[{'a': 1}, {'a': 2}, {'a': 3}]
```
When merging dicts in itr2params(), we don’t allow dicts to have same keys, as in
```
>>> [{'a': 1, 'b': 23}, {'a': 77, 'c': 'const'}]
```
since this would lead to unexpected results, for example when people use pgrid() to merge together params of previous studies to create params for a new study.

psweep.psweep.plist(name, seq)[source]#

Create a list of single-item dicts holding the parameter name and a value.

>>> plist('a', [1,2,3])
[{'a': 1}, {'a': 2}, {'a': 3}]

psweep.psweep.prep_batch(params, *, calc_templ_dir='templates/calc', machine_templ_dir='templates/machines', git=False, write_pset=False, template_mode='jinja', **kwds)[source]#

Write files based on templates.

Parameters:

params (Sequence[dict]) – See run()
calc_templ_dir (str) – Dir with templates.
machine_templ_dir (str) – Dir with templates.
git (bool) – Use git to commit local changes.
write_pset (bool) – Write the input pset to <calc_dir>/<pset_id>/pset.pk.
template_mode (str) – ‘dollar’ or ‘jinja’
**kwds – Passed to run().

Returns:

The database build from params.

Return type:

DataFrame

psweep.psweep.pset_hash(dct, method='sha1', raise_error=True, **kwds)[source]#

Reproducible hash of a dict for usage in database (hash of a pset).

We implement the convention to ignore prefix fields (book-keeping) and postfix fields (results). You can pass skip_prefix_cols / skip_postfix_cols to change that (see _get_col_filter()).

psweep.psweep.run(func, params, df=None, poolsize=None, dask_client=None, save=True, tmpsave=False, verbose=False, calc_dir='calc', simulate=False, database_basename='database.pk', backup=False, git=False, skip_dups=False, capture_logs=None, fill_value=<NA>)[source]#

Call func for each pset in params. Populate a DataFrame with rows from each call func(pset).

Parameters:

func (Callable) – must accept one parameter: pset (a dict {'a': 1, 'b': 'foo', ...}), return either an update to pset or a new dict, result will be processes as pset.update(func(pset))
params (Sequence[dict]) – each dict is a pset {'a': 1, 'b': 'foo', ...}
df (DataFrame) – append rows to this DataFrame, if None then either create new one or read existing database file from disk if found
poolsize (int) –
- None : use serial execution
- int : use multiprocessing.Pool (even for poolsize=1)
dask_client – A dask client. Use this or poolsize.
save (bool) – save final DataFrame to <calc_dir>/<database_basename> (pickle format only), default: “calc/database.pk”, see also calc_dir and database_basename
tmpsave (bool) – save the result dict from each pset.update(func(pset)) from each pset to <calc_dir>/tmpsave/<run_id>/<pset_id>.pk (pickle format only), the data is a dict, not a DataFrame row
verbose (Union[bool, Sequence[str]]) –
- bool : print the current DataFrame row
- sequence : list of DataFrame column names, print the row but only those columns
calc_dir (str) – Dir where calculation artifacts can be saved if needed, such as dirs per pset <calc_dir>/<pset_id>. Will be added to the database in _calc_dir field.
simulate (bool) – run everything in <calc_dir>.simulate, don’t call func, i.e. save what the run would create, but without the results from func, useful to check if params are correct before starting a production run
database_basename (str) – <calc_dir>/<database_basename>, default: “database.pk”
backup (bool) – Make backup of <calc_dir> to <calc_dir>.bak_<timestamp>_run_id_<run_id> where <run_id> is the latest _run_id present in df
git (bool) – Use git to commit all files written and changed by the current run (_run_id). Make sure to create a .gitignore manually before if needed.
skip_dups (bool) – Skip psets whose hash is already present in df. Useful when repeating (parts of) a study.
capture_logs (str) – {‘db’, ‘file’, ‘db+file’, None} Redirect stdout and stderr generated in func() to database (‘db’) column _logs, file <calc_dir>/<pset_id>/logs.txt, or both. If None then do nothing (default). Useful for capturing per-pset log text, e.g. print() calls in func will be captured.
fill_value – NA value used for missing values in the database DataFrame.

Returns:

The database build from params.

Return type:

DataFrame

psweep.psweep.stargrid(const, vary, vary_labels=None, vary_label_col='_vary', skip_dups=True)[source]#

Helper to create a specific param sampling pattern.

Vary params in a “star” pattern (and not a full pgrid) around constant values (middle of the “star”).

Parameters:

const (dict) – constant params
vary (Sequence[Sequence[dict]]) – list of plists
vary_labels (Sequence[str]) – database col names for parameters in vary
skip_dups – filter duplicate psets (see Notes below)

Return type:

Sequence[dict]

Notes

skip_dups: When creating a star pattern, duplicate psets can occur. By default try to filter them out (using filter_params_unique()) but ignore hash calculation errors and return non-reduced params in that case.

Examples

>>> import psweep as ps
>>> const=dict(a=1, b=77, c=11)
>>> a=ps.plist("a", [1,2,3,4])
>>> b=ps.plist("b", [77,88,99])
>>> c=ps.plist("c", [11,22,33,44])

>>> ps.stargrid(const, vary=[a, b])
[{'a': 1, 'b': 77, 'c': 11},
 {'a': 2, 'b': 77, 'c': 11},
 {'a': 3, 'b': 77, 'c': 11},
 {'a': 4, 'b': 77, 'c': 11},
 {'a': 1, 'b': 88, 'c': 11},
 {'a': 1, 'b': 99, 'c': 11}]

>>> ps.stargrid(const, vary=[a, b], skip_dups=False)
[{'a': 1, 'b': 77, 'c': 11},
 {'a': 2, 'b': 77, 'c': 11},
 {'a': 3, 'b': 77, 'c': 11},
 {'a': 4, 'b': 77, 'c': 11},
 {'a': 1, 'b': 77, 'c': 11},
 {'a': 1, 'b': 88, 'c': 11},
 {'a': 1, 'b': 99, 'c': 11}]

>>> ps.stargrid(const, vary=[a, b], vary_labels=["a", "b"])
[{'a': 1, 'b': 77, 'c': 11, '_vary': 'a'},
 {'a': 2, 'b': 77, 'c': 11, '_vary': 'a'},
 {'a': 3, 'b': 77, 'c': 11, '_vary': 'a'},
 {'a': 4, 'b': 77, 'c': 11, '_vary': 'a'},
 {'a': 1, 'b': 88, 'c': 11, '_vary': 'b'},
 {'a': 1, 'b': 99, 'c': 11, '_vary': 'b'}]

>>> ps.stargrid(const, vary=[ps.itr2params(zip(a,c)),b], vary_labels=["a+c", "b"])
[{'a': 1, 'b': 77, 'c': 11, '_vary': 'a+c'},
 {'a': 2, 'b': 77, 'c': 22, '_vary': 'a+c'},
 {'a': 3, 'b': 77, 'c': 33, '_vary': 'a+c'},
 {'a': 4, 'b': 77, 'c': 44, '_vary': 'a+c'},
 {'a': 1, 'b': 88, 'c': 11, '_vary': 'b'},
 {'a': 1, 'b': 99, 'c': 11, '_vary': 'b'}]

>>> ps.stargrid(const, vary=[ps.pgrid([zip(a,c)]),b], vary_labels=["a+c", "b"])
[{'a': 1, 'b': 77, 'c': 11, '_vary': 'a+c'},
 {'a': 2, 'b': 77, 'c': 22, '_vary': 'a+c'},
 {'a': 3, 'b': 77, 'c': 33, '_vary': 'a+c'},
 {'a': 4, 'b': 77, 'c': 44, '_vary': 'a+c'},
 {'a': 1, 'b': 88, 'c': 11, '_vary': 'b'},
 {'a': 1, 'b': 99, 'c': 11, '_vary': 'b'}]

psweep.psweep.system(cmd, **kwds)[source]#

Call shell command.

Parameters:

cmd (str) – shell command
kwds – keywords passed to subprocess.run

Return type:

CompletedProcess

psweep.psweep module

Contents

psweep.psweep module#