Bcl Conversion for CASAVA
Compatibility: Illumina HiSeq/GAIIx Integration v5.5, Illumina HiSeq 3000/4000 Integration
The script described in this article does not support bcl2fastq2 Conversion Software.
Illumina sequencing instruments generate per-cycle base call (Bcl) files as their primary sequencing output. However, since many downstream analysis applications use per-read FASTQ files, it is often necessary to convert the data in an Illumina sequencer Bcl file into FASTQ format.
In BaseSpace Clarity LIMS, the bclConversion script facilitates this conversion process for CASAVA. The script is configured on the Bcl Conversion and Demultiplexing step.
The bclConversion script invokes a second script - CASAVA’s configureBclToFastq.pl. This creates the required CASAVA configure and demultiplexing Makefiles, which configure the run. An additional make command takes in the generated Makefiles and executes calculations that convert the Bcl files into FASTQ format.
The bclConversion script attaches the resulting Demultiplex_Stats.htm file (containing calculated metrics) and HTML-format link files (containing links to the demultiplexed sample directories) to result file output placeholders configured on the step. In addition, the script parses metrics from the results into the LIMS.
The bclConversion script has three modes of operation: FULL, SIMULATE, and HARVEST. These are controlled using the -m parameter (see Parameters and usage), which must be set for the script to run.
• | FULL - This is the default setting. FULL mode performs the following: |
– | Information assembly: Loads information from the LIMS and prepares it for use. |
– | Automated kickoff: Runs the configureBclToFastq.pl script and the make command. |
– | Result harvest: Attaches Demultiplex_Stats.htm and the HTML link files and parses metrics. |
• | SIMULATE - This mode performs information assembly and then logs the Bcl configure and make commands to the log file. It does not execute commands or perform result harvest. |
• | HARVEST - This mode performs information assembly and result harvest. It does not execute commands. |
The Bcl Conversion and Demultiplexing automation is configured on a button on the Record Details screen of the Bcl Conversion and Demultiplexing step. The automation command line invokes the bclConversion script.
1. | The user invokes the script by clicking the Bcl conversion and demultiplexing button on the Record Details screen of the Bcl Conversion and Demultiplexing step. |
2. | The script does the following: |
• | Finds the corresponding sequencing step for the current flow cell and loads information from the LIMS. |
• | Runs configureBclToFastq.pl to produce the make files. |
• | Runs the make command to convert the Bcl files to fastq files. |
• | Looks for the Flowcell_demux_summary.xml and Demultiplex_Stats.htm files. |
• | Parses the Flowcell_demux_summary.xml file to retrieve the sample names. |
• | Stores the raw stats in memory. |
• | Searches the Unaligned project directories in the Bcl Conversion output directory for generated sample folders, and retrieves the sample locations containing the FASTQ reads. |
3. | The script then: |
• | Processes each sample in the LIMS and attaches the corresponding HTML link file (sample names must match). |
• | Calculates sample stats and stores them in result file output UDFs for each FASTQ folder link. |
4. | Finally, the script attaches the Demultiplex_Stats.htm and log file shared outputs to the step. |
On the network file share that CASAVA accesses, ensure the correct sample sheet is present in the run directory under <run directory>/Data/Intensities/BaseCalls.
If not present, download the correct sample sheet from the Cluster Generation step in the LIMS. This downloaded file must be named SampleSheet.csv.
On the Record Details screen of the Bcl Conversion and Demultiplexing step:
1. | Review the Bcl parameters provided, and edit as required. |
2. | Click Bcl conversion and demultiplexing. |
3. | Once the script has completed: |
• | Demultiplex Stats and Script Log Details files are generated and attached to the step. |
• | HTML link files are attached to each demultiplexed sample. |
• | Parsed metrics are updated on the step. |
• | QC flags are set on the samples. |
The following table defines the parameters used by the bclConversion script.
Parameter |
Description |
Notes |
u, username |
LIMS username (Required) |
|
p, password |
LIMS password (Required) |
|
i, processURI |
LIMS process URI (Required) |
|
c, casavaInstallPath |
CASAVA install location (Required) |
e.g. /opt/casava-1.8.2 |
d, demuxStats |
Demultiplex_Stats file placeholder LIMS ID (Required) |
|
e, logFileName |
Log file placeholder LIMS ID (Required) |
|
m, mode |
Script operation mode (Optional) |
Options: FULL (default), SIMULATE, or HARVEST See Script modes for details. |
t, instrumentType |
The instrument type prefix of the database properties used by the script (Required) |
e.g., ‘hiseqga’ or ‘hiseq3k4k’ If you are upgrading to the Illumina HiSeq/GAIIx Integration Package v5.5 or later, you will need to manually set this parameter. See Upgrade note. |
Below is an example automation command showing the script in use as part of the HiSeq 2500 integration package.
bash -c "/opt/gls/clarity/bin/java -Xms256m -Xmx1024m -jar /opt/gls/hiseqgaii-extensions.jar script:bclConversion \-i {processURI:v2} \-u {username} \-p {password} \-c {udf:CASAVA} \-t 'hiseqga' \-d {compoundOutputFileLuid0} \-e {compoundOutputFileLuid1}"
Note that the Xms and Xmx commands are specifying additional memory to be allocated for the script.
Note also that the value provided for the CASAVA install location, -c {udf:CASAVA}, is replaced with the value of the CASAVA step UDF at run time.
For more information, see the following articles in the LIMS Documentation:
• | Adding and Configuring Automations (Automations section) |
• | LIMS v4.1 and v4.2: Passing Parameters via EPP (Integrations via EPP section) |
• | LIMS v5: Passing Files and Tokens to Third-Party Programs (Automations section) |
Upgrade note: Setting the instrumentType parameter value
In new installations of the following integration packages, the instrumentType parameter value is already set in the out-of-the-box configuration:
• | Illumina HiSeq/GAIIx Integration Package v5.5 |
• | Illumina HiSeq 3000/4000 Integration Package v1 and later |
The following tables list the step and result file UDFs configured in the out-of-the-box configuration.
To view/configure these fields:
1. | In the Operations Interface, click Configuration > Details > Fields. |
2. | In the Show Fields for drop-down list, select the Bcl Conversion and Demultiplexing step. |
UDF |
Type |
Default Preset Value |
Notes |
CASAVA (Required) |
Single-line Text |
/opt/CASAVA |
Configure one preset entry for each CASAVA install directory - e.g., /illumina/casava/CASAVA-1.8.2. This allows users to safely select their desired CASAVA version. |
Folder name for demultiplexed output (Required) |
Single-line Text |
Unaligned |
|
Maximum clusters per FASTQ file (Required) |
Numeric |
4000000 |
Min: 0; Max: 16000000Minimum value changed from 1000000 to 0 to support the CASAVA 1.8.2 ability to turn off this check. Use of a value < 1000000 on CASAVA versions prior to 1.8.2 will result in an error. |
Number of CPU cores to use? (Required) |
Numeric |
1 |
Presets: 1;2;3;4;5;6;7;8 |
Ignore Missing Stats |
Check Box |
||
Ignore Missing Bcl |
Check Box |
||
Tiles |
Single-line Text |
||
Use Bases Mask |
Single-line Text |
Defaults of Y*,I*,I* and Y*,I*,I*,Y* added to support CASAVA 1.8.2 |
In CASAVA versions prior to 1.8.2, use of the presets will result in an error since these versions do not support or recognize dual indexing. |
With Failed Reads |
Check Box |
Cannot be used with CASAVA versions prior to 1.8.2. |
|
Adapter Sequence |
Single-line Text |
Cannot be used with CASAVA versions prior to 1.8.2. |
|
Number of allowed mismatches per index? (Required) |
Single-line Text |
0 |
Presets: 0;1;0,0;0,1;1,0;1,1 Note: Single value preset must be used for CASAVA versions prior to 1.8.2. See Rules and constraints. |
Additional Options |
Single-line Text |
Additional options provided to configureBclToFastq.pl when creating the make files. |
To view/configure these fields:
1. | In the Operations Interface, click Configuration > Details > Fields. |
2. | In the Show Fields for drop-down list, select ResultFile. |
UDF |
Type |
Other Information |
Yield PF (Gb) |
Numeric |
2 decimal places |
%PF |
Numeric |
1 decimal place |
# Reads |
Numeric |
0 decimal places |
% of Raw Clusters Per Lane |
Numeric |
1 decimal place |
% Perfect Index Read |
Numeric |
1 decimal place |
% One Mismatch Reads (Index) |
Numeric |
1 decimal place |
% Bases >=Q30 |
Numeric |
1 decimal place |
Ave Q Score* |
Numeric |
1 decimal place |
*Deprecated UDF - currently still displays in the interface, but no longer supported
Note the following:
• | The script will round the values of these UDFs to 15 significant digits. |
• | Display of these values is configured in the Configuration > Details > Fields dialog, as described above. |
• | These ResultFile UDFs are associated to the output type Output per Input per Reagent Label(shown below). Simply put, this means ‘demultiplexed’ - one file per indexed sample in the pool. |
To allow substitution for network mounts, the following database properties must be configured. The values of the Folder name for demultiplexed output UDF on the Bcl Conversion and Demultiplexing step and the Output Folder UDF on the Sequencing step are searched and replaced using these properties:
• | <instrumenttype>.bcl.netPathPrefixSearch |
• | <instrumenttype>.bcl.netPathPrefixReplace |
Note that the properties are prefixed with the instrumentTypeparameter value that is passed to the script, for example, ‘hiseq3k4k.bcl.netPathPrefixSearch’
To determine the Output Folder and Run ID UDF values of the Sequencing step that is run prior to Bcl conversion, the sequenceProcessName property is used:
• | seqservice.sequenceProcessName |
• | Clustering is not supported. |
• | Bcl conversion does not support multiple input containers. |
• | If multiple Sequencing steps are found, the most recently completed run will be used. |
• | Sample information from Bcl conversion is associated with a particular file output in the LIMS using the reagent index. Because of this, each sample in the pool must only have one index applied (this may be a dual index). If no index is applied, the sample must be labelled with "NoIndex" in the LIMS. |
• | The sample sheet file used in Bcl conversion and demultiplexing must be placed in the following directory within the run folder: /Data/Intensities/BaseCalls/ |
• | The sample sheet must be named SampleSheet.csv |
• | The sample sheet must not contain illegal characters for project and sample names. Illegal characters are the space character and the following: ? ( ) [ ] / \ = + < > : ; " ' , * ^ | and |
• | The file names Flowcell_demux_summary.xml and D emultiplex_Stats.htm must not be changed. |
• | For the script to run, the mode parameter (-m) must be set to one of the three options: FULL, SIMULATE, or HARVEST. See Script modes for details. |
• | Using the With Failed Reads and Adapter Sequence UDFs with CASAVA versions prior to 1.8.2 will result in an error. |
• | CASAVA versions prior to 1.8.2 do not support dual indexing. Consequently, the Number of allowed mismatches per index? UDF will result in an error if a preset with two values is used. |
• | The following UDFs must be populated on the Bcl Conversion and Demultiplexing step: |
– | CASAVA (Single-line Text) |
– | Folder name for demultiplexed output (Single-line Text) |
– | Maximum clusters per FASTQ file (Numeric) |
– | Number of CPU cores to use? (Numeric) |
– | Number of allowed mismatches per index? (Single-line Text) |
• | The Bcl Conversion and Demultiplexing step requires the following Illumina Sequencing step UDFs: |
– | Output Folder (Single-line Text) |
– | Run ID (Single-line Text) |
• | CASAVA v1.8.2 User Guide | Bcl Conversion and Demultiplexing |