# Data Structure ## Experimental design space $X_{space}$ ### Basic Syntax Each of the input varible is defined according to the variable type and domain. Continuous variable is specified by variable name, followed by lower and upper bounds. > Param_Continuous('varName', lower_bound, upper_bound) Discrete varaible is specified by variable name, followed by the (ordered) list of possible values in string format. > Param_Categorical('varName', ['level 1', 'level 2', 'level 3',...]) An example list of input parameter specifications including commonly used variable types: continuous, categorical and ordinal: ```python from obsidian.parameters import Param_Continuous, Param_Categorical, Param_Ordinal params = [ Param_Continuous('Temperature', -10, 30), Param_Continuous('Concentration', 10, 150), Param_Continuous('Enzyme', 0.01, 0.30), Param_Categorical('Variant', ['MRK001', 'MRK002', 'MRK003']), Param_Ordinal('StirRate', ['Low', 'Medium', 'High']), ] ``` Then the $X_{space}$ is specified as a `ParamSpace` class object, initialized by the list of parameters. ```python from obsidian import ParamSpace X_space = ParamSpace(params) ``` The `ParamSpace` class object can be exported into dictionary format to facilite save to json files and reload for future usage: ```python import json with open('X_space.json', 'w') as f: X_space_dict = X_space.save_state() json.dump(X_space_dict, f) with open('X_space.json', 'r') as f: X_space_dict = json.load(f) X_space_reload = ParamSpace.load_state(X_space_dict) ``` In addition, the `ParamSpace` class contains various instance methods for input variable transformation, which are implicitly called during the optimization but no need for direct access by the user. ### Additional Variable Types * Continuous observatioal variable For example, an entire time course was measured during the experiment, and data at all the different timepoints ranging from 0 to 10 are used for fitting. But during optimization, we are only interested in improving the results for a certain fixed time point at 6. ```python from obsidian.parameters import Param_Discrete_Numeric Param_Observational(name = 'Time', min = 0, max = 10, design_point = 6) ``` * Discrete numerical variable ```python from obsidian.parameters import Param_Discrete_Numeric Param_Discrete_Numeric('LightStage', [1, 2, 3, 4, 5]) ``` * Task variable Only one special 'task' categorical variable is allowed for encoding multiple tasks. Distinct response will be predicted for each task. ```python from obsidian.parameters import Task Task('TaskVar', ['Task_A', 'Task_B', 'Task_C', 'Task_D']) ``` ## Initial experimental conditions, or seed experiments $X_0$ When we start the APO workflow from scratch, the initial experimental conditions are usually generated by random sampling or design-of-experiments algorithms. For example, generate six input conditions $X_0$ according to previously specified $X_{space}$ using Latin hypercube sampling (LHS) method: ```python from obsidian.experiment import ExpDesigner designer = ExpDesigner(X_space, seed = 0) X0 = designer.initialize(m_initial = 6, method='LHS') print(X0.to_markdown()) ``` | | Temperature | Concentration | Enzyme | Variant | StirRate | |---:|--------------:|----------------:|----------:|:----------|:-----------| | 0 | 13.3333 | 68.3333 | 0.2275 | MRK003 | High | | 1 | 6.66667 | 115 | 0.0825 | MRK003 | Low | | 2 | 26.6667 | 45 | 0.0341667 | MRK002 | Medium | | 3 | 20 | 91.6667 | 0.275833 | MRK001 | Low | | 4 | -6.66667 | 21.6667 | 0.179167 | MRK002 | Medium | | 5 | 0 | 138.333 | 0.130833 | MRK001 | High | The `designer` returns experimental conditions as a pandas dataframe, which is the default data format in various `obsidian` functions. ## Experimental outcome variable(s) $Y$ ### Basic Syntax Similar to the `ParamSpace` object for input variables, there is `Target` class object which handles the specification and preprocessing for experimental outcome variables. For each outcome measurement, there are three essential arguments to be specified: * name: Variable name, which is a required input by user * f_transform: Transformation function for preprocessing the raw response values, to facilitate the numerical computations during optimization. - 'Identity': (default) No transformation - 'Standard': Normalization into zero mean and unit standard deviation - 'Logit_MinMax': Logit transofrmation with the range or scale automatically calculated based on data - 'Logit_Percentage': Assuming input response is a percentage ranging between 0 to 100, apply logit transofrmation with scale 1/100. * aim: Either 'max'(default) or 'min', which specifies the desirable direction for improvement. Currently it only handles continuous outcome values. Depend on the number of outcomes, define one `Target` object or a list of multiple objects: ```python from obsidian import Target target = Target(name = 'Yield', f_transform = 'Logit_Percentage', aim='max') target_multiple = [ Target(name = 'Yield', f_transform = 'Logit_Percentage', aim='max'), Target(name = 'Cost', f_transform = 'Standard', aim='min') ] ``` ### Example To demonstrate the usage of `Target` class, we simulate a single task experimental outcome $y_0$ using the previously generated $X_0$ and an analytical function 'shifted_parab'. ```python from obsidian.experiment import Simulator from obsidian.experiment.benchmark import shifted_parab simulator = Simulator(X_space, shifted_parab, name='Yield') y0 = simulator.simulate(X0) print(y0.to_markdown()) ``` | | Yield | |---:|--------:| | 0 | 47.8147 | | 1 | 62.5599 | | 2 | 60.7972 | | 3 | 39.1121 | | 4 | 83.0833 | | 5 | 52.2631 | If manually input $y_0$, it should be a pandas dataframe with the same variable name 'Yield' as specifed in the `target` definition. When the 'transform_f' function is called with 'fit=True' during the optimization workflow, the raw response will be saved as an attribute to `target` object ```python y_transformed = target.transform_f(y0, fit = True) type(target.f_raw) # torch.Tensor ``` The `Target` class object, as well as the input response 'f_raw' (if exists), can be exported into dictionary format to facilite save to json files and reload for future usage: ```python import json with open('target.json', 'w') as f: target_dict = target.save_state() json.dump(target_dict, f) with open('target.json', 'r') as f: target_dict = json.load(f) target_reload = Target.load_state(target_dict) ``` ## Use campaign object to manage data The `Campaign` class object acts as the central hub, seamlessly connecting all components within the APO workflow, including data management, optimizer, and experimental designer. It is the recommended approach that offers a more streamlined workflow compared to utilizing each component separately. Here is an example of creating a `Campaign` class object and adding the initial dataset to its 'data' attribute: ```python from obsidian.campaign import Campaign data_Iter0 = pd.concat([X0, y0], axis=1) my_campaign = Campaign(X_space, target, seed=0) my_campaign.add_data(data_Iter0) ``` The 'add_data' method will append each new batch of data to one pandas dataframe with incremental integer 'Iteration'. The new data should be a dataframe contains both the input experimental conditions and the target outcomes. There are various ways to retrieve data from `Campaign`: ```python print(my_campaign.data.to_markdown()) ``` | Observation ID | Temperature | Concentration | Enzyme | Variant | StirRate | Yield | Iteration | |-----------------:|--------------:|----------------:|----------:|:----------|:-----------|--------:|------------:| | 0 | 13.3333 | 68.3333 | 0.2275 | MRK003 | High | 47.4471 | 0 | | 1 | 6.66667 | 115 | 0.0825 | MRK003 | Low | 61.3989 | 0 | | 2 | 26.6667 | 45 | 0.0341667 | MRK002 | Medium | 63.6213 | 0 | | 3 | 20 | 91.6667 | 0.275833 | MRK001 | Low | 43.4116 | 0 | | 4 | -6.66667 | 21.6667 | 0.179167 | MRK002 | Medium | 84.5542 | 0 | | 5 | 0 | 138.333 | 0.130833 | MRK001 | High | 51.8577 | 0 | and ```python my_campaign.X my_campaign.y ```