Specifying Input FASTQ Files to Space Ranger

The spaceranger count pipeline requires FASTQ files as input, which typically come from running spaceranger mkfastq, a 10x-aware convenience wrapper for bcl2fastq. However, it is possible to use FASTQ files from other sources, such as Illumina's bcl2fastq or BCL Convert, a published dataset, or our bamtofastq.

For experiments with Gene Expression (GEX) data, the arguments available for specifying which FASTQ files spaceranger count should use are listed below.

Argument	Brief Description
`--fastqs`	Required for GEX analysis. The folder containing the FASTQ files to be analyzed. Generally, this is the `fastq_path` folder generated by `spaceranger mkfastq`. If the files are in multiple folders, for instance because one library was sequenced across multiple flow cells, supply a comma-separated list of paths.
`--libraries`	Required for GEX + protein expression (PEX) analysis. Path to a `libraries.csv` file declaring input libraries. See GEX + PEX Analysis page for details. Cannot be combined with `--fastqs` or `--sample`.
`--sample`	Optional. Sample name to analyze. This is as specified in the sample sheet supplied to `mkfastq` or `bcl2fastq`. Multiple names may be supplied as a comma-separated list, in which case they are treated as one sample.
`--lanes`	Optional. Lanes associated with this sample. Defaults to all lanes.

For multiomic experiments, separate libraries for the GEX and PEX reads are generated. In this case, you must construct a CSV file indicating the input data folder, sample name, and library type of each input library, then pass this file to spaceranger countusing the --libraries flag. Please see the GEX + PEX Analysis page for details on how to construct the libraries.csv file.

There are many ways to invoke bcl2fastq and mkfastq, resulting in a wide range of potential file names and locations as the output.

To serve as input for spaceranger, FASTQ files should conform to the naming conventions of bcl2fastq and mkfastq:

[Sample Name]_S1_L00[Lane Number]_[Read Type]_001.fastq.gz Where Read Type is one of:

I1: Sample index read (optional)
I2: Sample index read (optional)
R1: Read 1
R2: Read 2

The FASTQ files are specified by providing the path to the folder containing them (via the --fastqs argument) and then optionally restricting the selection by specifying the samples and/or lanes of interest.

Finding the right FASTQ files to process and the right arguments to process those files as desired can be confusing. To assist users, this page illustrates examples of how to handle common scenarios involving different FASTQ file folder hierarchies or naming conventions.

Where are your FASTQ files?

In an output folder from mkfastq or bcl2fastq (fastq_path) and:
- In a subdirectory next to Reports and Stats folders, with expected sample name prefixes.
- In the same directory as the Reports and Stats folders.
In a different folder:
- I don't see Reports or Stats anywhere. The files are named like "MySample_S1_L001_I1_001.fastq.gz"

How are they named?

Consistent with bcl2fastq/mkfastq, e.g. MySample_S1_L001_I1_001.fastq.gz (see above).
Like "read-I1-AAAAAAA_lane-001-chunk-001.fastq.gz".
Unlike any of the above examples.

The FASTQs are in an output folder from mkfastq or bcl2fastq, in a subdirectory next to Reports and Stats folders, with expected sample name prefixes.

How did I get here?


MKFASTQ_ID
├── MAKE_FASTQS_CS
└── outs
    └── fastq_path
        └── HFLC5BBXX
        |    ├── test_sample1
        |    |   ├── test_sample1_S1_L001_I1_001.fastq.gz
        |    |   ├── test_sample1_S1_L001_I2_001.fastq.gz
        |    |   ├── test_sample1_S1_L001_R1_001.fastq.gz
        |    |   ├── test_sample1_S1_L001_R2_001.fastq.gz
        |    |   ├── test_sample1_S1_L002_I1_001.fastq.gz
        |    |   ├── test_sample1_S1_L002_I2_001.fastq.gz
        |    |   ├── test_sample1_S1_L002_R1_001.fastq.gz
        |    |   ├── test_sample1_S1_L002_R2_001.fastq.gz
        |    |   ├── test_sample1_S1_L003_I1_001.fastq.gz
        |    |   ├── test_sample1_S1_L003_I2_001.fastq.gz
        |    |   ├── test_sample1_S1_L003_R1_001.fastq.gz
        |    |   └── test_sample1_S1_L003_R2_001.fastq.gz
        |    ├── test_sample2
        |    |   ├── test_sample2_S2_L001_I1_001.fastq.gz
        |    |   ├── test_sample2_S2_L001_I2_001.fastq.gz
        |    |   ├── test_sample2_S2_L001_R1_001.fastq.gz
        |    |   ├── test_sample2_S2_L001_R2_001.fastq.gz
        |    |   ├── test_sample2_S2_L002_I1_001.fastq.gz
        |    |   ├── test_sample2_S2_L002_I2_001.fastq.gz
        |    |   ├── test_sample2_S2_L002_R1_001.fastq.gz
        |    |   ├── test_sample2_S2_L002_R2_001.fastq.gz
        |    |   ├── test_sample2_S2_L003_I1_001.fastq.gz
        |    |   ├── test_sample2_S2_L003_I2_001.fastq.gz
        |    |   ├── test_sample2_S2_L003_R1_001.fastq.gz
        |    |   └── test_sample2_S2_L003_R2_001.fastq.gz
        ├── Reports
        ├── Stats
        ├── Undetermined_S0_L001_I1_001.fastq.gz
        ├── Undetermined_S0_L001_I2_001.fastq.gz
        ...
        └── Undetermined_S0_L003_R2_001.fastq.gz

If you ran bcl2fastq directly, then the output root folder is where fastq_path is in the hierarchy above.

"Expected sample name prefixes" means you have one set of fastq files per sample, prefixed with the name of the sample as it appears in the simple CSV layout file or IEM samplesheet.

For more information on the naming conventions, visit Illumina's support site or refer to the bcl2fastq User Guide. The scenario where your files do not conform to the naming convention is described in a different section later on this page.

The table below describes the arguments you pass into the pipeline to target the right FASTQ files in this scenario. Be sure to substitute the capitalized text as appropriate.

Situation	Argument + Value
`mkfastq`	`--fastqs=MKFASTQ_ID/outs/fastq_path`
`mkfastq`, multiple flow cells	`--fastqs=MKFASTQ_ID/outs/fastq_path1,MKFASTQ_ID/outs/fastq_path2`
bcl2fastq directly	`--fastqs=/PATH/TO/bcl2fastq_output`
Process `test_sample1` from all lanes (`mkfastq`)	`--fastqs=MKFASTQ_ID/outs/fastq_path` \ `--sample=test_sample1`
Process `test_sample1` from lane 1 only (`mkfastq`)	`--fastqs=MKFASTQ_ID/outs/fastq_path` \ `--sample=test_sample1` \ `--lanes=1`
Process `test_sample1` and `test_sample2` as a single merged sample (`mkfastq`)	`--fastqs=MKFASTQ_ID/outs/fastq_path` \ `--sample=test_sample1,test_sample2`

The FASTQs are in an output folder from mkfastq or bcl2fastq, in the same directory as the Reports and Stats folders.

How did I get here?

An Illumina Experiment Manager-formatted samplesheet was used with either no entry or a blank entry for the Sample_Project column. Your hierarchy looks similar to this:


fastq_path
├── Reports
├── Stats
├── test_sample_S1_L001_I1_001.fastq.gz
├── test_sample_S1_L001_I2_001.fastq.gz
├── test_sample_S1_L001_R1_001.fastq.gz
├── test_sample_S1_L001_R2_001.fastq.gz
├── test_sample_S1_L002_I1_001.fastq.gz
├── test_sample_S1_L002_I2_001.fastq.gz
├── test_sample_S1_L002_R1_001.fastq.gz
├── test_sample_S1_L002_R2_001.fastq.gz
├── test_sample_S1_L003_I1_001.fastq.gz
├── test_sample_S1_L003_I2_001.fastq.gz
├── test_sample_S1_L003_R1_001.fastq.gz
├── test_sample_S1_L003_R2_001.fastq.gz
├── Undetermined_S0_L001_I1_001.fastq.gz
├── Undetermined_S0_L001_I2_001.fastq.gz
...
└── Undetermined_S0_L003_R2_001.fastq.gz

This is correct; use the same arguments as if the FASTQs were organized into subfolders within the output folder.

Situation	Argument + Value
`mkfastq`	`--fastqs=MKFASTQ_ID/outs/fastq_path`
bcl2fastq directly	`--fastqs=/PATH/TO/bcl2fastq_output`
Process `test_sample` from all lanes (`mkfastq`)	`--fastqs=MKFASTQ_ID/outs/fastq_path` \ `--sample=test_sample`
Process `test_sample` from lane 1 only (`mkfastq`)	`--fastqs=MKFASTQ_ID/outs/fastq_path` \ `--sample=test_sample` \ `--lanes=1`

The FASTQs are in a different folder; I don't see Reports or Stats anywhere. The files are named like MySample_S1_L001_I1_001.fastq.gz.

How did I get here?

It is likely that FASTQ files have been transferred from either a mkfastq or bcl2fastq run into another folder. They still retain the names assigned by bcl2fastq, which is a combination of sample name, sample order, lane, read type, and chunk. Your file hierarchy looks similar to this:


PROJECT_FOLDER
├── MySample_S1_L001_I1_001.fastq.gz
├── MySample_S1_L001_I2_001.fastq.gz
├── MySample_S1_L001_R1_001.fastq.gz
├── MySample_S1_L001_R2_001.fastq.gz
├── MySample_S1_L002_I1_001.fastq.gz
├── MySample_S1_L002_I2_001.fastq.gz
├── MySample_S1_L002_R1_001.fastq.gz
└── MySample_S1_L002_R2_001.fastq.gz

This is correct; since the files are named according to the bcl2fastq standard, use the same arguments as if the FASTQs were organized into a flow cell folder or mkfastq output folder.

Situation	Argument + Value
No filtering according to sample or lane	`--fastqs=/PATH/TO/PROJECT_FOLDER`
Process `MySample` from all lanes	`--fastqs=/PATH/TO/PROJECT_FOLDER` \ `--sample=MySample`
Process `MySample` from lane 1 only	`--fastqs=/PATH/TO/PROJECT_FOLDER` \ `--sample=MySample` \ `--lanes=1`

The FASTQs are named like read-I1-AAAAAAA_lane-001-chunk-001.fastq.gz.

How did I get here?

The 10x demux pipeline was used to demultiplex the flow cell instead of mkfastq. This pipeline has been deprecated, but its output can still be used to run spaceranger count. Your file hierarchy likely has many files in it, named as such:


demux_id
├── BCL_PROCESSOR_CS
└── outs
    └── fastq_path
        ├── read-I1_si-AAAAAAAA_lane-001-chunk-001.fastq.gz
        ├── read-I2_si-AAAAAAAA_lane-001-chunk-001.fastq.gz
        ...
        ├── read-I1_si-TTTTTTTT_lane-002-chunk-001.fastq.gz
        ├── read-I2_si-TTTTTTTT_lane-002-chunk-001.fastq.gz
        ├── read-I1_si-X_lane-002-chunk-001.fastq.gz
        ├── read-I2_si-X_lane-002-chunk-001.fastq.gz
        ├── read-RA_si-AAAAAAAA_lane-001-chunk-001.fastq.gz
        ...
        ├── read-RA_si-TTTTTTTT_lane-002-chunk-001.fastq.gz
        └── read-RA_si-X_lane-002-chunk-001.fastq.gz

To ingest the correct FASTQ files from a demux run, you need to know the 10x sample dual-index associated with your sample. That selects the correct files from the sample indices in your folder:

Situation	Argument + Value
No filtering according to sample or lane	`--fastqs=/PATH/TO/PROJECT_FOLDER`
Process sample associated with SI-TT-A1	`--fastqs=/PATH/TO/PROJECT_FOLDER` \ `--indices=SI-TT-A1`
Process sample associated with SI-TT-A1, lane 1 only	`--fastqs=/PATH/TO/PROJECT_FOLDER` \ `--indices=SI-TT-A1` \ `--lanes=1`

The FASTQs are not named like any of the above examples.

How did I get here?

It is likely that you received files that were processed through a proprietary LIMS system, which employs its own naming conventions.

10x pipelines need files named in the bcl2fastq or demux convention in order to run properly. You need to determine which file corresponds to which sample and which read type, likely by consulting your sequencing core or the individual who demultiplexed your flow cell.

It is highly likely that these files were initially processed with bcl2fastq, so you need to rename the files in the following format, once you track down their origin:

[Sample Name]_S1_L00[Lane Number]_[Read Type]_001.fastq.gz

Where Read Type is one of:

I1: Sample index read (optional)
I2: Sample index read (optional)
R1: Read 1
R2: Read 2

Notice that I1 and I2 are optional because it is possible to have datasets without index reads at all if sequencing was performed without sample multiplexing. After you have renamed those files into that format, use the following arguments:

Situation	Argument + Value
No filtering according to sample or lane	`--fastqs=/PATH/TO/PROJECT_FOLDER`
Process `SAMPLENAME` from all lanes	`--fastqs=/PATH/TO/PROJECT_FOLDER` \ `--sample=SAMPLENAME`
Process `SAMPLENAME` from lane 1 only	`--fastqs=/PATH/TO/PROJECT_FOLDER` \ `--sample=SAMPLENAME` \ `--lanes=1`

Overview

FASTQ file naming convention

FASTQs with expected name prefixes

FASTQs without project folder

FASTQs in different folder with expected name

FASTQs with sample index in name

FASTQ names are not standard convention