Natural Language
This utility component was build to aid in data synthesis.
readFile()
Inputs: char filename[]
Output: String wordlist[]
Read file takes in the file name and reads it in line by line converting the characters on each line to a String data object and adding it to the list which is returned at the end.
distributionFromFile()
Inputs: char filename[]
Output: [Distribution length, Distribution letter]
This is function takes the file name and returns a list containing the distribution of word lengths and the distribution of character (letter) usage.
distribution()
Inputs: String wordlist[]
Outputs: [Distribution length, Distribution letter]
Takes the wordlist, can be the output from readFile(), and performs the counts and generates the distributions.
synthDataFromFile()
Inputs: char file[], int no_words
Outputs: String wordlist[]
Takes a filename and a number of words to generate. It then generates word and letter distributions for the file and based on that generates the requested number of random strings with the same length and letter distribution.
synthDataFromStringList()
Inputs: String list[], int no_words
Outputs: String wordlist[]
Takes a list of words and a number of words to generate. It then generates word and letter distributions for the list and based on that generates the requested number of random strings with the same length and letter distribution.
synthDataFromDistributions()
Inputs: Distributions dists[], int no_words
Outputs: String wordlist[]
Takes the list of distributions, can be the output from distributions, and a number of words to generate. It generates the requested number of random strings with the same length and letter distribution.
printDist()
Inputs: Distribution dist
Prints out the item: proportion pairs for the distribution. Useful for checking for issues in the data.