Parameter Setting
Our main GI algorithm script has many different parameter settings available here we will give a brief summary of the different parameters and behaviours available and how to use them.
Parameter |
Type |
Description |
---|---|---|
run_number |
Integer |
Id number for current run provided by user |
run_name |
String |
Id name for the current run provided by user |
populationSize |
Integer |
Number of individuals in the population |
generationCount |
Integer |
Number of generations to continue training in the current run |
sourceFile |
String |
File containing the original function code to be optimized |
outFolder |
String |
Directory for output files |
codeFile |
String |
File name to give to the temporary code saved out during code assignment |
function |
String |
Name of the function to be optimized |
prefunction |
String |
File containing the component code pre the function |
postfunction |
String |
File containing the component code post the function |
performanceTests |
List of Strings |
List of data files used for testing performance against evolved individuals |
mutPercentage |
Decimal |
The percentage of the population in each generation to have a mutation applied |
hgtPercentage |
Decimal |
The percentage of the population in each generation to have a mutation applied |
mutWeights |
List of Decimals |
The probability of different mutations being selected: insert before, insert after, modify, delete. The array is normalised so doesn’t have to add to 1. |
metrics |
List of Integers |
Passed to code assessment to dictate the metric used to measure the assessment results |
metric_interval |
Integer |
For multiple metrics the frequency in number of generations with which to change to next metric |
dynamic_weights |
Boolean |
If true switches alternative mutation weights in based on metric_interval |
altMutWeights |
List of Decimals |
Provides alternative mutation weights in same format as mutWeights |
performance_interval |
Integer |
Frequency in number of generations on which performance of the individuals is tested on unseen data |
elite |
Integer |
Number of top individuals to transfer directly into the next generation |
current_metric |
Integer |
Initially set to first metric in metric list and stores the correct current metric during run |
trainingFile |
String |
Data file to be used for training individuals |
staticWords |
Integer |
Number of lines to take from start of data file |
sampleWords |
Integer |
Number of lines to be sampled at random from data file with uniform probability |
synthWords |
Integer |
Number of words/lines to be generated based on length and character frequency |
synthFromSample |
Boolean |
Generate new words from sampled words rather than full file |
log_all |
Boolean |
If true all individuals code is saved, if false only best individual of each generation is saved. False is recommended as alternative has very high storage requirements. |
currentWeights |
Decimal |
The weights being used by the system currently, change between generations if alternative weights are in use. |
testerFile |
String |
The compiled “.o” file for the tester which should be used |
Metrics
Our existing tester for hash functions supports three different metrics:
Speed - \(s_n\), Just the clock speed time plus time penalties for individual \(n\)
Speed and Short Code Length - \(s_n+5l_n\), where \(l_n\) is the token length of the individual
Speed and Long Code Length - \((s_n-(s_n%10)) + 5(200-l_n)\)
Time penalties are 10 milliseconds, this is enough to overcome the timing noise and enough to effect the overall results without being so much that a single error would completely fail the code.
Data treatments
These parameters are designed to allow for four different data treatments:
Static
Sample
Synthetic
Synthetic and Sample mix
In this case we are referring to what we do with a pre-existing list of “words”.
A static treatment draws words from the start of the file and so returns the same set of words every generation. Otherwise a sample draws words at random from the whole file with an equal probability of selecting any word, this gives a different set of words every generation with possible overlap dependent on the size of the original data file. The synthetic data treatment analyses the whole file of words for word length and character usage, those distributions are then used to generate a set of random string “words”, this will give a more unique set of data for each generation. For a synthetic and sample mix we take a smaller sample, use the sample to generate the length and character usage distributions and then generate the rest of the “words” from those distributions. This will give not just a distinct set of words but will include a mix of real words and random strings with a different letter distribution each time.