Willows
Citation
- H Zhang, M Wang and X Chen. Willows: A Memory Efficient Tree and Forest Construction Software. BMC Bioinformatics. 10:130, 2009.
- H Zhang and B Singer. Recursive Partitioning in Health Sciences. Springer, 1999.
- H Zhang, C Yu, and B Singer, B. Cell and Tumor Classification using Gene Expression Data: Construction of Forests. Proceedings of the National Academy of Sciences USA, 100: 4168-4172, 2003.
- L Breiman L. Random forests. Machine Learning, 45: 5-32, 2001.
Condition of Use
Willows can be used and distributed free of charge under the following terms and conditions.
Any research using this program or the methods or ideas behind it should acknowledge the use of Willows, and cite at least the first reference in the Citation section.
THIS SOFTWARE IS PROVIDED BY THE COLLABORATIVE CENTER FOR STATISTICS IN SCIENCE AT YALE UNIVERSITY AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COLLABORATIVE CENTER FOR STATISTICS IN SCIENCE AT YALE UNIVERSITY BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Versions
Methodology
Input File Formats
(I) Main input file
Willows supports input files in the same simple format as in RTREE: the first line indicates the variable type (response, nominal or ordinal) with no particular order. R or r implies the response; D or "d" implies deleting that covariate from the analysis; O or "o" implies a continuous or an ordinal covariate; N or "n" implies nominal covariate. The columns should be separated by spaces or tabs (any number of these is allowed). In the predictors matrix, NA or . represent missing value. Response can only be integers, missing value in response column is not allowed. An improvement from the RTREE is that it reads an extra line for the name of each covariate.
We explain this file using the sample.txt for the general classification purpose, and sample_SNP.txt for the use involving SNPs as predictors. The format of sample_SNP.txt is the same as sample.txt, except that the SNP predictors must be coded as 0, 1, or 2 if available, and NA or . (without the quotes) if it is missing.
The first row in sample.txt and sample_SNP.txt is
r o o o o o o o o o o o o o o o o o o o o |
The first letter r means that the first column represents the response, and the other o specifies that all predictors are ordinal and none of the predictors will be excluded.
The second row in sample.txt and sample_SNP.txt is
a a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 ... |
The first letter a is the name for the response column, and a1 specifies the name of each covariate.
The data begin from the second row. Specifically, the second row in sample.txt is as follows:
0 0.2006632 0.588495 0.5081931 0.1813249 ... |
and the second row in sample_SNP.txt is as follows:
0 1 0 1 0 0 2 1 1 1 1 2 2 0 0 2 1 2 0 1 1 |
The remaining rows are organized similarly.
(II) Additional input file
To use the prediction function that predicts an outcome based on the predictors, additional data files are needed as inputs for prediction.
The format of additional input files is the same as the main input file with one exception. The response column is not required. If it is included, it will be ignored for the prediction purpose.
Output file formats
All output files are all saved in local drives for future review and analysis. They can be imported back to Willows at any time.
(I) .inf file.
The .inf file describes the datasets and the tree structures.
Part I: Summary of the data set
It describes the number of covariates, the types of the covariates, the frequencies or ranges of the covariates
Part II: Information for tree reconstruction
Each tree is described in by one row. In each row, the 1st number is the serial ID of the tree. The remaining numbers describe the nodes in the tree.
An internal node is characterized as follows: (1) the node number within the tree (node 1 is the root node); (2) the sequence number for the left daughter node (when there is no daughter node, the node is terminal, and in this case, 0 is used to indicate a terminal node); (3) the sequence number of the splitting variable; and (4) the corresponding splitting value (a numerical splitting value means that the splitting variable is ordinal and a set of integers means that the splitting variable is nominal).
A terminal node is characterized as follows: (1) the node number within the tree (node 1 is the root node); (2) 0 (because it has no daughter node); and (3) the frequency counts of the response variable (one for each level of the response).
(II) .importance file
The .importance file records the out-of-bag error and the raw importance scores of each variable calculated by the random forest method.
The format is as follows:
The 1st line: the out of bag error rate.
The 2nd line: the number of samples
The 3rd line: string: index importance
The 4th line and the rest:
Column 1: variable index
Column 2: importance score
(III) .count file
The .count file records the frequency counts that each predictor is used for splitting in all trees of a deterministic forest.
The format is as follows:
The 1st line: number of variables selected
The 2nd line: string: index count
The 3rd line and the rest:
Column 1: variable index
Column 2: the count of a variable appearing as a splitting variable in the forest
(IV) .pre file
.pre file records the prediction results of the test datasets.
The format is described below:
The 1st line: the number of samples for prediction
The 2nd line through the end: the predicted responses. For a forest, one prediction is made for each sample. For multiple trees (RTREE), one prediction is made by each of the trees and organized in one column.
Downloads
Executables
- Willows_Windows.exe: Self-extracting archive for installation on Windows system.
- Willows_Linux32.tar.gz: GZIP-compressed Tar file containing executable on 32-bit Linux system. (The javax.swing package is required to be installed in your system before running Willows.)
- Willows_Linux64.tar.gz: GZIP-compressed Tar file containing executable on 64-bit Linux system. (The javax.swing package is required to be installed in your system before running Willows.)
Sample Files
- sample.txt: A sample for the input file.
- sample_SNP.txt: A sample for inputting SNP data.
- test.txt: A sample for prediction.
-
test_SNP.txt: A sample for prediction with SNP data.
- rtree_sample.txt.inf: Classification tree output file - data description and tree structure.
- rtree_test.txt.pre: Classification tree output file - prediction results.
- rtree_SNP_sample_SNP.txt.inf: Classification tree output file with SNP data - data description and tree structure.
-
rtree_SNP_test_SNP.txt.pre: Classification tree output file with SNP data - prediction results.
- rforest_sample.txt.inf: Random forest output file - data description and tree structure.
- rforest_sample.txt.importance: Random forest output file - variable importance score.
- rforest_test.txt.pre: Random forest output file - prediction results.
- rforest_SNP_sample_SNP.txt.inf: Random forest output file with SNP data - data description and tree structure.
- rforest_SNP_sample_SNP.txt.importance: Random forest output file with SNP data - variable importance score.
-
rforest_SNP_test_SNP_.txt.pre: Random forest output file with SNP data - prediction results.
- dforest_sample.txt.inf: Deterministic forest output file - data description and tree structure.
- dforest_sample.txt.count: Deterministic forest output file - variable count.
- dforest_test.txt.pre: Deterministic forest output file - prediction results.
- dforest_SNP_sample_SNP.txt.inf: Deterministic forest output file with SNP data - data description and tree structure.
- dforest_SNP_sample_SNP.txt.count: Deterministic forest output file with SNP data - variable count.
-
dforest_SNP_test_SNP.txt.pre: Deterministic forest output file with SNP data - prediction results.
Running Willows
a. Running Willows with user-friendly graphical interface on Windows or Linux.
1. After downloading the executables for either Windows or Linux according to your operating system, and extracting all files from the archive, double click on Willows_win.jar to start the program.
2. To analyze a dataset, click on File menu and click on load file to add an input file.
Please note that if you are running Willows with the GUI in Linux, please make sure there is no white space in the file names or paths. White spaces in file names and paths will cause a Wrong number of parameters error.
Click Run button to perform the computation. At any time you can click the Stop button to stop the calculation.
6. After the computation is complete, Willows displays the results in new windows.
i. Trees structure
Click <<, <, >, >> to go to the first tree, the previous tree, the next tree, and the last tree, respectively. Users can also enter a specific tree number and browse it. The Save button saves the current tree structure into a png file. The following window shows the first six layers of a tree, which is in general the most important part of the tree.
Internal and terminal nodes are represented by ellipsoids and rectangles, respectively. The frequency counts of the response are displayed inside each node, and the splitting variable and the corresponding thresholds are provided for internal nodes.
b. Running Willows from the command line in Windows.
After downloading the executables for Windows and extracting all files from the archive, there are six executable files: rtree.exe, rtree_SNP.exe, rforest.exe, rforest_SNP.exe, dforest.exe and dforest_SNP.exe in the folder, representing the classification tree, random forest, deterministic forest methods and their memory compression versions, respectively.
To run any of these programs, open the command line window, go to the folder where the programs and files are saved, and run the executable program. The details of the parameters are listed below.
Please note that if the files are not located in the same folder as the executable, use the full path instead of just the file names. Please make sure if there is white space in the file names or paths, the names or paths should be quoted.
exe file | parameter |
rtree
rtree_SNP |
-t input file name
-s 1 (default, for entropy impurity) or 2 (gini impurity) -p test file name -r a positive number less than or equal to 1 representing the threshold to remove a variable according to the percentage of missing observations (default, all variables are included). -minnode a positive number representing minimum size of terminal nodes (default 1). |
rforest
rforest_SNP |
-t input file name
-s 1 (default, for entropy impurity) or 2 (gini impurity) -p test file name -r a positive number less than or equal to 1 representing the threshold to remove a variable according to the percentage of missing observations (default, all variables are included). -n the number of trees in a forest (default 1000) -m the number of allowable variables to split a node (default log2(number of variables)) minnode a positive number representing minimum size of terminal nodes (default 1). |
dforest
dforest_SNP |
-t input file name
-s 1 (default, for entropy impurity) or 2 (gini impurity) -p test file name -r a positive number less than or equal to 1 representing the threshold to remove a variable according to the percentage of missing observations (default, all variables are included). -o the number of top splits for the root node (default 20) -n the number of top splits for the internal nodes (the root node excluded) (default 3). minnode a positive number representing minimum size of terminal nodes (default 1). |
c. Running Willows from the command line in Linux.
After downloading the executables for Linux and extracting all files using command: tar xvf Willows_Linux32.tar.gz or tar xvf Willows_Linux64.tar.gz, there are six executable files: rtree, rtree_SNP, rforest, rforest_SNP, dforest and dforest_SNP in the folder, representing the classification tree, random forest, deterministic forest methods and their memory compression versions, respectively. The input parameters are the same as those described above.
Please note that if the files are not located in the same folder as the executables, use the full path instead of just the file names. Please make sure there is no white space in the file names or paths. White spaces in file names and paths will cause a Wrong number of parameters error.