Bcl Conversion for CASAVA

Compatibility: Illumina HiSeq/GAIIx Integration v5.5, Illumina HiSeq 3000/4000 Integration

The script described in this article does not support bcl2fastq2 Conversion Software.

Overview

Illumina sequencing instruments generate per-cycle base call (Bcl) files as their primary sequencing output. However, since many downstream analysis applications use per-read FASTQ files, it is often necessary to convert the data in an Illumina sequencer Bcl file into FASTQ format.

In BaseSpace Clarity LIMS, the bclConversion script facilitates this conversion process for CASAVA. The script is configured on the Bcl Conversion and Demultiplexing step.

Script overview

The bclConversion script invokes a second script - CASAVA’s configureBclToFastq.pl. This creates the required CASAVA configure and demultiplexing Makefiles, which configure the run. An additional make command takes in the generated Makefiles and executes calculations that convert the Bcl files into FASTQ format.

The bclConversion script attaches the resulting Demultiplex_Stats.htm file (containing calculated metrics) and HTML-format link files (containing links to the demultiplexed sample directories) to result file output placeholders configured on the step. In addition, the script parses metrics from the results into the LIMS.

Script modes

The bclConversion script has three modes of operation: FULL, SIMULATE, and HARVEST. These are controlled using the -m parameter (see Parameters and usage), which must be set for the script to run.

•

FULL - This is the default setting. FULL mode performs the following:

–

Information assembly: Loads information from the LIMS and prepares it for use.

–

Automated kickoff: Runs the configureBclToFastq.pl script and the make command.

–

Result harvest: Attaches Demultiplex_Stats.htm and the HTML link files and parses metrics.

•

SIMULATE - This mode performs information assembly and then logs the Bcl configure and make commands to the log file. It does not execute commands or perform result harvest.

•

HARVEST - This mode performs information assembly and result harvest. It does not execute commands.

How it works

The Bcl Conversion and Demultiplexing automation is configured on a button on the Record Details screen of the Bcl Conversion and Demultiplexing step. The automation command line invokes the bclConversion script.

The user invokes the script by clicking the Bcl conversion and demultiplexing button on the Record Details screen of the Bcl Conversion and Demultiplexing step.

The script does the following:

•

Finds the corresponding sequencing step for the current flow cell and loads information from the LIMS.

•

Runs configureBclToFastq.pl to produce the make files.

•

Runs the make command to convert the Bcl files to fastq files.

•

Looks for the Flowcell_demux_summary.xml and Demultiplex_Stats.htm files.

•

Parses the Flowcell_demux_summary.xml file to retrieve the sample names.

•

Stores the raw stats in memory.

•

Searches the Unaligned project directories in the Bcl Conversion output directory for generated sample folders, and retrieves the sample locations containing the FASTQ reads.

The script then:

•

Processes each sample in the LIMS and attaches the corresponding HTML link file (sample names must match).

•

Calculates sample stats and stores them in result file output UDFs for each FASTQ folder link.

Finally, the script attaches the Demultiplex_Stats.htm and log file shared outputs to the step.

User interaction

On the network file share that CASAVA accesses, ensure the correct sample sheet is present in the run directory under <run directory>/Data/Intensities/BaseCalls.

If not present, download the correct sample sheet from the Cluster Generation step in the LIMS. This downloaded file must be named SampleSheet.csv.

On the Record Details screen of the Bcl Conversion and Demultiplexing step:

Review the Bcl parameters provided, and edit as required.

Click Bcl conversion and demultiplexing.

Once the script has completed:

•

Demultiplex Stats and Script Log Details files are generated and attached to the step.

•

HTML link files are attached to each demultiplexed sample.

•

Parsed metrics are updated on the step.

•

QC flags are set on the samples.

Parameters and usage

The following table defines the parameters used by the bclConversion script.

Parameter	Description	Notes
u, username	LIMS username (Required)
p, password	LIMS password (Required)
i, processURI	LIMS process URI (Required)
c, casavaInstallPath	CASAVA install location (Required)	e.g. /opt/casava-1.8.2
d, demuxStats	Demultiplex_Stats file placeholder LIMS ID (Required)
e, logFileName	Log file placeholder LIMS ID (Required)
m, mode	Script operation mode (Optional)	Options: FULL (default), SIMULATE, or HARVEST See Script modes for details.
t, instrumentType	The instrument type prefix of the database properties used by the script (Required)	e.g., ‘hiseqga’ or ‘hiseq3k4k’ If you are upgrading to the Illumina HiSeq/GAIIx Integration Package v5.5 or later, you will need to manually set this parameter. See Upgrade note.

Usage

Below is an example automation command showing the script in use as part of the HiSeq 2500 integration package.

bash -c "/opt/gls/clarity/bin/java -Xms256m -Xmx1024m -jar /opt/gls/hiseqgaii-extensions.jar script:bclConversion \-i {processURI:v2} \-u {username} \-p {password} \-c {udf:CASAVA} \-t 'hiseqga' \-d {compoundOutputFileLuid0} \-e {compoundOutputFileLuid1}"

Note that the Xms and Xmx commands are specifying additional memory to be allocated for the script.

Note also that the value provided for the CASAVA install location, -c {udf:CASAVA}, is replaced with the value of the CASAVA step UDF at run time.

For more information, see the following articles in the LIMS Documentation:

•

Adding and Configuring Automations (Automations section)

•

LIMS v4.1 and v4.2: Passing Parameters via EPP (Integrations via EPP section)

•

LIMS v5: Passing Files and Tokens to Third-Party Programs (Automations section)

Upgrade note: Setting the instrumentType parameter value

In new installations of the following integration packages, the instrumentType parameter value is already set in the out-of-the-box configuration:

•

Illumina HiSeq/GAIIx Integration Package v5.5

•

Illumina HiSeq 3000/4000 Integration Package v1 and later

Configuration

The following tables list the step and result file UDFs configured in the out-of-the-box configuration.

Step UDFs

To view/configure these fields:

In the Operations Interface, click Configuration > Details > Fields.

In the Show Fields for drop-down list, select the Bcl Conversion and Demultiplexing step.

UDF	Type	Default Preset Value	Notes
CASAVA (Required)	Single-line Text	/opt/CASAVA	Configure one preset entry for each CASAVA install directory - e.g., /illumina/casava/CASAVA-1.8.2. This allows users to safely select their desired CASAVA version.
Folder name for demultiplexed output (Required)	Single-line Text	Unaligned
Maximum clusters per FASTQ file (Required)	Numeric	4000000	Min: 0; Max: 16000000Minimum value changed from 1000000 to 0 to support the CASAVA 1.8.2 ability to turn off this check. Use of a value < 1000000 on CASAVA versions prior to 1.8.2 will result in an error.
Number of CPU cores to use? (Required)	Numeric	1	Presets: 1;2;3;4;5;6;7;8
Ignore Missing Stats	Check Box
Ignore Missing Bcl	Check Box
Tiles	Single-line Text
Use Bases Mask	Single-line Text	Defaults of Y,I,I* and Y,I,I,Y added to support CASAVA 1.8.2	In CASAVA versions prior to 1.8.2, use of the presets will result in an error since these versions do not support or recognize dual indexing.
With Failed Reads	Check Box		Cannot be used with CASAVA versions prior to 1.8.2.
Adapter Sequence	Single-line Text		Cannot be used with CASAVA versions prior to 1.8.2.
Number of allowed mismatches per index? (Required)	Single-line Text	0	Presets: 0;1;0,0;0,1;1,0;1,1 Note: Single value preset must be used for CASAVA versions prior to 1.8.2. See Rules and constraints.
Additional Options	Single-line Text		Additional options provided to configureBclToFastq.pl when creating the make files.

ResultFile UDFs

To view/configure these fields:

In the Operations Interface, click Configuration > Details > Fields.

In the Show Fields for drop-down list, select ResultFile.

UDF	Type	Other Information
Yield PF (Gb)	Numeric	2 decimal places
%PF	Numeric	1 decimal place
# Reads	Numeric	0 decimal places
% of Raw Clusters Per Lane	Numeric	1 decimal place
% Perfect Index Read	Numeric	1 decimal place
% One Mismatch Reads (Index)	Numeric	1 decimal place
% Bases >=Q30	Numeric	1 decimal place
Ave Q Score*	Numeric	1 decimal place

*Deprecated UDF - currently still displays in the interface, but no longer supported

Note the following:

•

The script will round the values of these UDFs to 15 significant digits.

•

Display of these values is configured in the Configuration > Details > Fields dialog, as described above.

•

These ResultFile UDFs are associated to the output type Output per Input per Reagent Label(shown below). Simply put, this means ‘demultiplexed’ - one file per indexed sample in the pool.

Property Configuration

To allow substitution for network mounts, the following database properties must be configured. The values of the Folder name for demultiplexed output UDF on the Bcl Conversion and Demultiplexing step and the Output Folder UDF on the Sequencing step are searched and replaced using these properties:

•

<instrumenttype>.bcl.netPathPrefixSearch

•

<instrumenttype>.bcl.netPathPrefixReplace

Note that the properties are prefixed with the instrumentTypeparameter value that is passed to the script, for example, ‘hiseq3k4k.bcl.netPathPrefixSearch’

To determine the Output Folder and Run ID UDF values of the Sequencing step that is run prior to Bcl conversion, the sequenceProcessName property is used:

•

seqservice.sequenceProcessName

Rules and Constraints

•

Clustering is not supported.

•

Bcl conversion does not support multiple input containers.

•

If multiple Sequencing steps are found, the most recently completed run will be used.

•

Sample information from Bcl conversion is associated with a particular file output in the LIMS using the reagent index. Because of this, each sample in the pool must only have one index applied (this may be a dual index). If no index is applied, the sample must be labelled with "NoIndex" in the LIMS.

•

The sample sheet file used in Bcl conversion and demultiplexing must be placed in the following directory within the run folder: /Data/Intensities/BaseCalls/

•

The sample sheet must be named SampleSheet.csv

•

The sample sheet must not contain illegal characters for project and sample names. Illegal characters are the space character and the following: ? ( ) [ ] / \ = + < > : ; " ' , * ^ | and

•

The file names Flowcell_demux_summary.xml and D emultiplex_Stats.htm must not be changed.

•

For the script to run, the mode parameter (-m) must be set to one of the three options: FULL, SIMULATE, or HARVEST. See Script modes for details.

•

Using the With Failed Reads and Adapter Sequence UDFs with CASAVA versions prior to 1.8.2 will result in an error.

•

CASAVA versions prior to 1.8.2 do not support dual indexing. Consequently, the Number of allowed mismatches per index? UDF will result in an error if a preset with two values is used.

•

The following UDFs must be populated on the Bcl Conversion and Demultiplexing step:

–

CASAVA (Single-line Text)

–

Folder name for demultiplexed output (Single-line Text)

–

Maximum clusters per FASTQ file (Numeric)

–

Number of CPU cores to use? (Numeric)

–

Number of allowed mismatches per index? (Single-line Text)

•

The Bcl Conversion and Demultiplexing step requires the following Illumina Sequencing step UDFs:

–

Output Folder (Single-line Text)

–

Run ID (Single-line Text)

Vendor Information

•

CASAVA v1.8.2 User Guide | Bcl Conversion and Demultiplexing