Hello world in nitroproc
Essentially, you will write scripts and will pass them to nitroproc. These scripts will contain one or several instructions (for example for merging, sorting, sub-setting your data - and many more). Let's see for example, how you would sort a dataset and store that as a csv file
Download csv data file (300M)
In this case we will sort 12,277,440 keys in a file containing 4 variables by two keys; most software will have problems processing this. The first column has several missing values.
file specifies the input file we want to sort.
coltypes specify the column types we expect (check the documentation for available types).
In this case we expect the first two columns to be dates in the dd/mm/yyyy format, the third column to be integers, and the final one to be strings.
headers is an optional argument that indicates the column headers (in case the input file contains the headers in the first row, you will not want to put the parameter).
order specifies whether each key should be sorted in ascending or descending order.
by specifies the columns to sort by.
outname specifies the name
of the output file
out_first_row is used to specify whether we want to save the headers in the output (in the first row)
Before executing the program, you need to ensure that you have a folder:: C:/nitroproc. This folder will be used for internal calculations
Now, there are two options:
1 Opening nitroproc.exe and entering the path of the file containing the script we want to process. This one is probably easier
2The other option is the one you would typically use when calling nitroproc programmatically from another software such as Python or R: you just open the command prompt and head to the folder where nitroproc is installed. Just type nitroproc.exe C:/mylocation/script1.txt where the argument we are passing to nitroproc is obviously the path to the file we want to process. Note: when calling this programmatically, you would first use your external code to write the script we want to process, and then would do a system call executing nitroproc and passing the argument the same we are doing here
output file this contains the output file, in this case sorted by the two columns we have specified
log file this file is created using the same name/path of your script, but with a log extension. It is a very useful tool
for seeing how many variables and observations where read, whether any error was detected and how long it took. Of course in a real example, we would use lots of instructions
in the same script, and all of them would be displayed here.
The first line displays the instruction that was read. It then displays other information, about how that instruction was parsed. The
Blocksize was not
defined in the script, and was then assigned the default value of 1,000,000.
It is the maximum size of the internal vectors used to sort. Any value here is fine, if you use a small one the process will take longer but will run smoothly. On the other
hand, if you use a large one it will be faster but we might encounter a memory allocation exception. The default here is 1,000,000 which typically implies an overall use of
400M of RAM memory.
You can then see the amount of observations read and written. As you can see it took around 18 minutes to sort the file by two date keys. A very nice output is that the log automatically prints the first observations of the result. This is very useful when building very long scripts, where you tend to lose sight of what is the structure of the data at any given point (you can see that because we have missing values for the first column, we have empty values for the first variable). The log finally prints the number of instructions processed and some other general statistics.
logtracer file similar to the log file, this file is created when the script gets executed. It has the same name of the script, but will contain a .logtracer extension. In a similar fashion to the log file, it can opened using any text editor, such as Notepad. This file contains the logical structure of your program and is very useful when executing multiple instructions. It shows how the different instructions relate to each other. For example, you might execute a where statement and produce file A, then do other random stuff, and do a summary(group-by sum) to produce file B, and you might want to merge A and B together. This file will show you how that AB merge was executed, and what were the intermediate instructions used to get to that merge: in this case a where statement, and a summary() one.Since this is a very simple script, we just see one instruction.