ソースを参照

Update instructions

Markus Borg 5 年 前
コミット
db31c0e028
1 ファイル変更27 行追加28 行削除
  1. 27 28
      code/README.md

+ 27 - 28
code/README.md

@@ -20,17 +20,17 @@ All scripts and compilations has been tested on Linux and Mac.
 To use docker, one needs Docker installed as well.
 
 # Table of Contents
-1. [Usage SZZ algorithm](#szz_usage)
-2. [SZZ with Docker](#szz_docker)
-3. [Feature Extraction](#feat_extract)
+1. [Running SZZ Unleashed](#szz_usage)
+2. [SZZ Unleashed with Docker](#szz_docker)
+3. [Example Application: Training a Classifier for Just-in-Time Bug Prediction](#feat_extract)
 4. [Authors](#authors)
 
-## Usage SZZ algorithm <a name="szz_usage"></a>
+## Running SZZ Unleashed <a name="szz_usage"></a>
 The figure shows the SZZ Unleashed workflow, i.e., running three Python scripts followed by executing the final jar-file.
 
 ![SZZ Unleashed workflow](/workflow.png)
 
-### Grab issues ###
+### Fetch issues ###
 To get issues one needs a bug tracking system. As an example the project Jenkins uses [JIRA](https://issues.jenkins-ci.org).
 From here it is possible to fetch issues that we then can link to bug fixing commits.
 
@@ -99,7 +99,7 @@ way and it includes duplicates when it comes to both introducers and fixes. A
 fix can be made several times and a introducer could be responsible for many
 fixes.
 
-## Use Docker to generate fix_and_bug_introducing_pairs.json. <a name="szz_docker"></a>
+## Use SZZ Unleashed with Docker <a name="szz_docker"></a>
 
 There exists a *Dockerfile* in the repository. It contains all the steps in chronological order that is needed to generate the **fix\_and\_bug\_introducing\_pairs.json**. Simply run this command in the directory where the Dockerfile is located:
 
@@ -123,9 +123,8 @@ Note that the temporary container must be running while the *docker cp* command
 docker ps
 ```
 
-## Feature Extraction <a name="feat_extract"></a>
-Now that the potential bug-introducing commits has been identified, the
-repository can be mined for features.
+## Example Application: Training a Classifier for Just-in-Time Bug Prediction <a name="feat_extract"></a>
+To illustrate what the output from SZZ Unleashed can be used for, we show how to train a classifier for Just-in-Time Bug prediction, i.e., predicting if individual commits are bug-introducing or not. We now have a set of bug-introducing commits and a set or correct commits. We proceed by representing individual commits by a set of features, based on previous research on bug prediction. 
 
 ### Code Churns ###
 The most simple features are the code churns. These are easily extracted by
@@ -151,32 +150,32 @@ To extract the diffusion features, just run:
 `python assemble_diffusion_features.py --repository <path_to_repo> --branch <branch>`
 
 ### Experience Features ###
-Maybe the most uncomfortable feature group. The experience features are the
-features that measures how much experience a developer has, both how recent
-but also how much experience the developer has overall with the code.
+Maybe the most sensitive feature group. The experience features are the
+features that measure how much experience a developer has, calculated based on both overall 
+activity in the repository and recent activity.
 
 The features are:
 
 1. Overall experience.
 2. Recent experience.
 
-The script builds a graph to keep track of each authors experience. So the intial
+The script builds a graph to keep track of each authors experience. The intial
 run is:
 `python assemble_experience_features.py --repository <repo_path> --branch <branch> --save-graph`
 
-This will result in a graph which the script could use for future analysis
+This results in a graph that the script below uses for future analysis
 
 To rerun the analysis without generating a new graph, just run:
 `python assemble_experience_features.py --repository <repo_path> --branch <branch>`
 
 ### History Features ###
-The history are as follows:
+The history is represented by the following:
 
 1. The number of authors in a file.
 2. The time between contributions made by the author.
 3. The number of unique changes between the last commit.
 
-The same as with the experience features, the script must initially generate a graph
+Analogous to the experience features, the script must initially generate a graph
 where the file meta data is saved.
 `python assemble_history_features.py --repository <repo_path> --branch <branch> --save-graph`
 
@@ -184,22 +183,22 @@ To rerun the script without generating a new graph, use:
 `python assemble_history_features.py --repository <repo_path> --branch <branch>`
 
 ### Purpose Features ###
-The purpose feature is just a single feature and that is if the commit is a fix o
-not. To extract it use:
+The purpose feature is just a binary feature representing whether a commit is a fix or
+not. This feature can be extracted by running:
 
 `python assemble_purpose_features.py --repository <repo_path> --branch <branch>`
 
 ### Coupling ###
-A more complex number of features are the coupling features. These indicates
+A more complex type of features are the coupling features. These indicate
 how strong the relation is between files and modules for a revision. This means
-that two files can have a realtion even though they don't have a realtion
-inside the source code itself. So by mining these, features that gives
-indications in how many files that a commit actually has made changes to are
+that two files can have a relation even though they don't have a relation
+inside the source code itself. By mining these, features that give
+indications of how many files that a commit actually has made changes to are
 found.
 
-The mining is made by a docker image containing the tool code-maat.
+The mining is made by a Docker image containing the tool code-maat.
 
-These features takes long time to extract but is mined using:
+Note that calculating these features is time-consuming. They are extracted by:
 
 ```python
 python assemble_features.py --image code-maat --repo-dir <path_to_repo> --result-dir <path_to_write_result>
@@ -210,16 +209,16 @@ It is also possible to specify which commits to analyze. This is done with the
 CLI option `--commits <path_to_file_with_commits>`. The format of this file is
 just lines where each line is equal to the corresponding commit SHA-1.
 
-If the analyzation is made by several docker containers, one has to specify
+If the analysis is made by several Docker containers, one has to specify
 the `--assemble` option which stands for assemble. This will collect and store
 all results in a single directory.
 
-The script is capable of checking if the are any commits that haven't been
+The script can check if there are any commits that haven't been
 analyzed. To do that, specify the `--missing-commits` option.
 
 ## Classification ##
-Now that data has been assembled the training and testing of the ML model can
-be made. To do this, simply run the model script in the model directory:
+Now that all features have been extracted, the training and testing of the machine learning classifier can
+be made. In this example, we train a random forest classifier. To do this, run the model script in the model directory:
 ```python
 python model.py train
 ```