# Data Structure

## Experimental design space $X_{space}$

### Basic Syntax 

Each of the input varible is defined according to the variable type and domain.
Continuous variable is specified by variable name, followed by lower and upper bounds. 
> Param_Continuous('varName', lower_bound, upper_bound)

Discrete varaible is specified by variable name, followed by the (ordered) list of possible values in string format.
> Param_Categorical('varName', ['level 1', 'level 2', 'level 3',...])

An example list of input parameter specifications including commonly used variable types: continuous, categorical and ordinal:

```python
from obsidian.parameters import Param_Continuous, Param_Categorical, Param_Ordinal

params = [
    Param_Continuous('Temperature', -10, 30),
    Param_Continuous('Concentration', 10, 150),
    Param_Continuous('Enzyme', 0.01, 0.30),
    Param_Categorical('Variant', ['MRK001', 'MRK002', 'MRK003']),
    Param_Ordinal('StirRate', ['Low', 'Medium', 'High']),
]
```
Then the $X_{space}$ is specified as a `ParamSpace` class object, initialized by the list of parameters.

```python
from obsidian import ParamSpace
X_space = ParamSpace(params)
```

The `ParamSpace` class object can be exported into dictionary format to facilite save to json files and reload for future usage:

```python
import json

with open('X_space.json', 'w') as f:
    X_space_dict = X_space.save_state()
    json.dump(X_space_dict, f)
  
with open('X_space.json', 'r') as f:
    X_space_dict = json.load(f)
    X_space_reload = ParamSpace.load_state(X_space_dict)
```

In addition, the `ParamSpace` class contains various instance methods for input variable transformation, which are implicitly called during the optimization but no need for direct access by the user. 

### Additional Variable Types

* Continuous observatioal variable

    For example, an entire time course was measured during the experiment, and data at all the different timepoints ranging from 0 to 10 are used for fitting.
    But during optimization, we are only interested in improving the results for a certain fixed time point at 6.

    ```python
    from obsidian.parameters import Param_Discrete_Numeric
    Param_Observational(name = 'Time', min = 0, max = 10, design_point = 6)
    ```
    
* Discrete numerical variable

    ```python
    from obsidian.parameters import Param_Discrete_Numeric
    Param_Discrete_Numeric('LightStage', [1, 2, 3, 4, 5])
    ```

* Task variable

    Only one special 'task' categorical variable is allowed for encoding multiple tasks.
    Distinct response will be predicted for each task. 
    
    ```python
    from obsidian.parameters import Task
    Task('TaskVar', ['Task_A', 'Task_B', 'Task_C', 'Task_D'])
    ```

## Initial experimental conditions, or seed experiments $X_0$

When we start the APO workflow from scratch, the initial experimental conditions are usually generated by random sampling or design-of-experiments algorithms. 

For example, generate six input conditions $X_0$ according to previously specified $X_{space}$ using Latin hypercube sampling (LHS) method:

```python
from obsidian.experiment import ExpDesigner

designer = ExpDesigner(X_space, seed = 0)
X0 = designer.initialize(m_initial = 6, method='LHS')
print(X0.to_markdown())
```

|    |   Temperature |   Concentration |    Enzyme | Variant   | StirRate   |
|---:|--------------:|----------------:|----------:|:----------|:-----------|
|  0 |      13.3333  |         68.3333 | 0.2275    | MRK003    | High       |
|  1 |       6.66667 |        115      | 0.0825    | MRK003    | Low        |
|  2 |      26.6667  |         45      | 0.0341667 | MRK002    | Medium     |
|  3 |      20       |         91.6667 | 0.275833  | MRK001    | Low        |
|  4 |      -6.66667 |         21.6667 | 0.179167  | MRK002    | Medium     |
|  5 |       0       |        138.333  | 0.130833  | MRK001    | High       |


The `designer` returns experimental conditions as a pandas dataframe, which is the default data format in various `obsidian` functions.  


## Experimental outcome variable(s) $Y$

### Basic Syntax 

Similar to the `ParamSpace` object for input variables, there is `Target` class object which handles the specification and preprocessing for experimental outcome variables. 

For each outcome measurement, there are three essential arguments to be specified:
* name: Variable name, which is a required input by user
* f_transform: Transformation function for preprocessing the raw response values, to facilitate the numerical computations during optimization. 
    - 'Identity': (default) No transformation
    - 'Standard': Normalization into zero mean and unit standard deviation
    - 'Logit_MinMax': Logit transofrmation with the range or scale automatically calculated based on data
    - 'Logit_Percentage': Assuming input response is a percentage ranging between 0 to 100, apply logit transofrmation with scale 1/100.
* aim: Either 'max'(default) or 'min', which specifies the desirable direction for improvement. Currently it only handles continuous outcome values. 


Depend on the number of outcomes, define one `Target` object or a list of multiple objects:

```python
from obsidian import Target

target = Target(name = 'Yield', f_transform = 'Logit_Percentage', aim='max')

target_multiple = [
    Target(name = 'Yield', f_transform = 'Logit_Percentage', aim='max'),
    Target(name = 'Cost', f_transform = 'Standard', aim='min')
]
```

### Example 

To demonstrate the usage of `Target` class, we simulate a single task experimental outcome $y_0$ using the previously generated $X_0$ and an analytical function 'shifted_parab'.

```python
from obsidian.experiment import Simulator
from obsidian.experiment.benchmark import shifted_parab

simulator = Simulator(X_space, shifted_parab, name='Yield')
y0 = simulator.simulate(X0)
print(y0.to_markdown())
```

|    |   Yield |
|---:|--------:|
|  0 | 47.8147 |
|  1 | 62.5599 |
|  2 | 60.7972 |
|  3 | 39.1121 |
|  4 | 83.0833 |
|  5 | 52.2631 |

If manually input $y_0$, it should be a pandas dataframe with the same variable name 'Yield' as specifed in the `target` definition. 

When the 'transform_f' function is called with 'fit=True' during the optimization workflow, the raw response will be saved as an attribute to `target` object
```python
y_transformed = target.transform_f(y0, fit = True)
type(target.f_raw) # torch.Tensor
```

The `Target` class object, as well as the input response 'f_raw' (if exists), can be exported into dictionary format to facilite save to json files and reload for future usage:

```python
import json

with open('target.json', 'w') as f:
    target_dict = target.save_state()
    json.dump(target_dict, f)
  
with open('target.json', 'r') as f:
    target_dict = json.load(f)
    target_reload = Target.load_state(target_dict)
```

## Use campaign object to manage data

The `Campaign` class object acts as the central hub, seamlessly connecting all components within the APO workflow, including data management, optimizer, and experimental designer. 
It is the recommended approach that offers a more streamlined workflow compared to utilizing each component separately.


Here is an example of creating a `Campaign` class object and adding the initial dataset to its 'data' attribute:

```python
from obsidian.campaign import Campaign

data_Iter0 = pd.concat([X0, y0], axis=1)
my_campaign = Campaign(X_space, target, seed=0)
my_campaign.add_data(data_Iter0)
```

The 'add_data' method will append each new batch of data to one pandas dataframe with incremental integer 'Iteration'. The new data should be a dataframe contains both the input experimental conditions and the target outcomes.   


There are various ways to retrieve data from `Campaign`:

```python
print(my_campaign.data.to_markdown())
```

|   Observation ID |   Temperature |   Concentration |    Enzyme | Variant   | StirRate   |   Yield |   Iteration |
|-----------------:|--------------:|----------------:|----------:|:----------|:-----------|--------:|------------:|
|                0 |      13.3333  |         68.3333 | 0.2275    | MRK003    | High       | 47.4471 |           0 |
|                1 |       6.66667 |        115      | 0.0825    | MRK003    | Low        | 61.3989 |           0 |
|                2 |      26.6667  |         45      | 0.0341667 | MRK002    | Medium     | 63.6213 |           0 |
|                3 |      20       |         91.6667 | 0.275833  | MRK001    | Low        | 43.4116 |           0 |
|                4 |      -6.66667 |         21.6667 | 0.179167  | MRK002    | Medium     | 84.5542 |           0 |
|                5 |       0       |        138.333  | 0.130833  | MRK001    | High       | 51.8577 |           0 |

and 
```python
my_campaign.X
my_campaign.y
```