Parameter Setting

Our main GI algorithm script has many different parameter settings available here we will give a brief summary of the different parameters and behaviours available and how to use them.

Parameter

Type

Description

run_number

Integer

Id number for current run provided by user

run_name

String

Id name for the current run provided by user

populationSize

Integer

Number of individuals in the population

generationCount

Integer

Number of generations to continue training in the current run

sourceFile

String

File containing the original function code to be optimized

outFolder

String

Directory for output files

codeFile

String

File name to give to the temporary code saved out during code assignment

function

String

Name of the function to be optimized

prefunction

String

File containing the component code pre the function

postfunction

String

File containing the component code post the function

performanceTests

List of Strings

List of data files used for testing performance against evolved individuals

mutPercentage

Decimal

The percentage of the population in each generation to have a mutation applied

hgtPercentage

Decimal

The percentage of the population in each generation to have a mutation applied

mutWeights

List of Decimals

The probability of different mutations being selected: insert before, insert after, modify, delete. The array is normalised so doesn’t have to add to 1.

metrics

List of Integers

Passed to code assessment to dictate the metric used to measure the assessment results

metric_interval

Integer

For multiple metrics the frequency in number of generations with which to change to next metric

dynamic_weights

Boolean

If true switches alternative mutation weights in based on metric_interval

altMutWeights

List of Decimals

Provides alternative mutation weights in same format as mutWeights

performance_interval

Integer

Frequency in number of generations on which performance of the individuals is tested on unseen data

elite

Integer

Number of top individuals to transfer directly into the next generation

current_metric

Integer

Initially set to first metric in metric list and stores the correct current metric during run

trainingFile

String

Data file to be used for training individuals

staticWords

Integer

Number of lines to take from start of data file

sampleWords

Integer

Number of lines to be sampled at random from data file with uniform probability

synthWords

Integer

Number of words/lines to be generated based on length and character frequency

synthFromSample

Boolean

Generate new words from sampled words rather than full file

log_all

Boolean

If true all individuals code is saved, if false only best individual of each generation is saved. False is recommended as alternative has very high storage requirements.

currentWeights

Decimal

The weights being used by the system currently, change between generations if alternative weights are in use.

testerFile

String

The compiled “.o” file for the tester which should be used

Metrics

Our existing tester for hash functions supports three different metrics:

  1. Speed - \(s_n\), Just the clock speed time plus time penalties for individual \(n\)

  2. Speed and Short Code Length - \(s_n+5l_n\), where \(l_n\) is the token length of the individual

  3. Speed and Long Code Length - \((s_n-(s_n%10)) + 5(200-l_n)\)

Time penalties are 10 milliseconds, this is enough to overcome the timing noise and enough to effect the overall results without being so much that a single error would completely fail the code.

Data treatments

These parameters are designed to allow for four different data treatments:

  1. Static

  2. Sample

  3. Synthetic

  4. Synthetic and Sample mix

In this case we are referring to what we do with a pre-existing list of “words”.

A static treatment draws words from the start of the file and so returns the same set of words every generation. Otherwise a sample draws words at random from the whole file with an equal probability of selecting any word, this gives a different set of words every generation with possible overlap dependent on the size of the original data file. The synthetic data treatment analyses the whole file of words for word length and character usage, those distributions are then used to generate a set of random string “words”, this will give a more unique set of data for each generation. For a synthetic and sample mix we take a smaller sample, use the sample to generate the length and character usage distributions and then generate the rest of the “words” from those distributions. This will give not just a distinct set of words but will include a mix of real words and random strings with a different letter distribution each time.