# README PPMI-Phase1-2 IR1 Analysis Release

sncRNA-seq in PPMI whole blood samples release of PPMI Phase1-2 (Interim Release 1)

## Overview

This README explains the data as part of Interim Release 1 (IR1) of data
from "small RNA transcriptome sequencing of PPMI whole blood samples release of Phase1-2 data". Data represents the release of data processing and does not represent a full analysis.  This release includes counts, quantification data, alignment (bams) files and trimmed sequencing (fastqs) files. Samples with obvious QC issues (low read counts, low mapping rates) have been deemed failures. However, additional QC and analysis is ongoing.

Data was sequenced at Hudson Alpha's (http://www.hudsonalpha.org)
Genomic Services Lab (https://gsl.hudsonalpha.org/index) on a Illumina NovaSeq6000.

All samples were prepared using the Bioo smRNA library prep kit.

## Naming convention

### Regex: ([a-zA-z0-9-_]*)\.{0,1}

Sample names are period delimited with the following fields:

    FIELD 1: study and release version (PPMI-IR1)
    FIELD 2: individualID or PATNO (e.g. 3385)
    FIELD 3: visit (e.g. V08)
    FIELD 4: sampleID (e.g. PP00175168)
    FIELD 5: sequencing facility ID (e.g. 5104-SL-0001)
    FIELD 6+: optional compression tar/gz and additional file sub-dividers
Files have been grouped by analysis (mapped_vs_bactera_and_viruses, mapped_vs_hg38, mapped_vs_mirbase, mapped_vs_other_sncrnas, trimmed, counts).

### Example hg38 mapping file:

mapped_vs_hg38/PPMI-IR1.3127.BL.3157977.5628-SL-2031.bam

### Example raw counts file:

counts/mirna_quantification_matrix_raw.csv

### Example RPM normalize counts file:

counts/mirna_quantification_matrix_rpm_norm.csv

Count files for all sncRNAs have been generated by miRMaster 1.0 and are available in the counts/ directory.

### Directory Structure

sncRNA/
├── README
├── mapped_vs_bacteria_and_viruses
│   ├── *.bam       # alignment files
│   └── *.bam.bai   # alignment index files
├── mapped_vs_hg38
│   ├── *.bam       # alignment files
│   └── *.bam.bai   # alignment index files
├── mapped_vs_mirbase
│   ├── *.bam       # alignment files
│   └── *.bam.bai   # alignment index files
├── mapped_vs_other_sncrnas
│   ├── *.bam       # alignment files
│   └── *.bam.bai   # alignment index files
├── trimmed
│   ├── *.fastq.gz  # trimmed sequencing files
└── counts 
    ├── mirna_quantification_matrix_raw.csv.gz        # Raw read counts for all miRNAs of miRBase v22
    ├── mirna_quantification_matrix_rpm_norm.csv.gz   # Reads per million normalized read counts for all miRNAs of miRBase v22
    ├── mirna_quantification_matrix_rpmmm_norm.csv.gz # Reads per million mapped to miRNA normalized read counts for all miRNAs of miRBase v22
    ├── sncrna_quantification_matrix_raw.csv.gz       # Raw read counts for all sncRNAs of miRBase v22, Ensemb ncRNA 85, GtRNAdb 2.0 and piRBase 1.0
    ├── sncrna_quantification_matrix_rpm_norm.csv.gz  # Reads per million normalized read counts for all sncRNAs of miRBase v22, Ensemb ncRNA 85, GtRNAdb 2.0 and piRBase 1.0
    ├── all_quantification_matrix_raw.csv.gz          # Raw read counts for all sncRNAs included in sncrna_quantification_matrix, as well as raw read counts for all Bacteria and Viruses of NCBI Refseq 74
    └── all_quantification_matrix_rpm_norm.csv.gz     # Reads per million normalized read counts for all sncRNAs included in sncrna_quantification_matrix, as well as RPM normalized read counts for all Bacteria and Viruses of NCBI Refseq 74

## Analysis notes
Analysis completed by the Chair for Clinical Bioinformatics of Saarland University

### Genome and Databases Info
GChr38.p10 limited to reference chromosomes, miRBase v22, Ensembl ncRNA 85, NCBI RefSeq 74 Bacteria and Viruses, GtRNAdb 2.0, piRBase 1.0

### FASTQ Generation
The raw sequence image files from the Illumina NovaSeq6000  in the form of bcl are converted to the fastq format using bcltofastq v1.8.4 and checked for quality to ensure the quality scores do not deteriorate at ends.

### Genome alignment:
Input: trimmed reads
Aligner: Bowtie 1.1.2
Options: -v0 -m 100 --fullref -S

### miRNA alignment:
Input: trimmed and collapsed reads
Aligner: Bowtie 1.1.2
Options: -v 2 -a --best --strata --norc --fullref -S

### Other sncRNA alignment:
Input: trimmed and collapsed reads
Aligner: Bowtie 1.1.2
Options: -v 0 -a --norc --fullref -S

### Bacteria and viruses alignment:
Input: trimmed and collapsed reads that did not map to the genome with at most 1 mismatch
Aligner: Bowtie 1.1.2
Options: -v 0 -a --fullref -S

### Counts file format:
Each file is tab separated and has the format: Reference\tSAMPLE1\tSAMPLE2\t...\tSAMPLEN

The reference column stores the identifier or name of the molecule (e.g. hsa-miR-486-5p, piR-hsa-12423, tRNA-Ala-AGC-2-1). 
Molecules of Ensembl ncRNA 85 are stored in the format: <ENST_ID>|<Gene_Symbol> (e.g. ENST00000362423.1|SNORA21).
Bacteria and Viruses are stored in the format: <Refseq description>|<Refseq_ID> (e.g. Burkholderia mallei SAVP1 chromosome I|NC_008785.1).

### Contacts
Date: 01/20/2020
Authors: Tobias Fehlmann (tobias.fehlmann@ccb.uni-saarland.de), Andreas Keller (andreas.keller@ccb.uni-saarland.de)

