Skip to content

Datasets

This page describes the data and Monte Carlo (MC) simulation samples used in the analysis, and explains how MC events are normalised to the observed luminosity.


1. Observed Data

The analysis uses CMS proton–proton collision data collected during Run 2016G and 2016H, corresponding to an integrated luminosity of:

\[\mathcal{L}_{\text{int}} = 16.39\,\text{fb}^{-1}\]

1.1 Golden JSON Masking

Not all luminosity blocks recorded by CMS are suitable for physics analysis, some periods have sub-detectors switched off or operating in degraded mode. Only events within certified luminosity blocks are used, enforced by filtering against the official CMS Golden JSON file:

Cert_271036-284044_13TeV_Legacy2016_Collisions16_JSON.txt

1.2 Data

Period Dataset CERN Open Data \(\mathcal{L}\ (\text{fb}^{-1})\)
2016G /MuonEG/Run2016G-UL2016_MiniAODv2_NanoAODv9-v1/NANOAOD Link 7.65
2016H /MuonEG/Run2016H-UL2016_MiniAODv2_NanoAODv9-v1/NANOAOD Link 8.74

2. Monte Carlo Simulation

All simulation samples correspond to the CMS RunIISummer20UL16 (2016 Ultra-Legacy) campaign at \(\sqrt{s} = 13\,\text{TeV}\), using NanoAOD v9 format. Samples are sourced from the CERN Open Data Portal and accessed via XRootD:

root://eospublic.cern.ch//eos/opendata/cms/mc/RunIISummer20UL16NanoAODv9/...

2.1 Signal

Dataset CERN Open Data \(\sigma\) (pb)
GluGluHToWWTo2L2N_M-125_TuneCP5_minloHJJ_13TeV-powheg-jhugen727-pythia8 Link 1.0315

2.2 Backgrounds

Drell-Yan

Dataset CERN Open Data \(\sigma\) (pb)
DYJetsToLL_M-50_TuneCP5_13TeV-madgraphMLM-pythia8 Link 6189.39

Top Quark

Dataset CERN Open Data \(\sigma\) (pb)
TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8 Link 87.310
ST_t-channel_top_4f_InclusiveDecays_TuneCP5_13TeV-powheg-madspin-pythia8 Link 44.33
ST_t-channel_antitop_4f_InclusiveDecays_TuneCP5_13TeV-powheg-madspin-pythia8 Link 26.38
ST_tW_antitop_5f_inclusiveDecays_TuneCP5_13TeV-powheg-pythia8 Link 35.60
ST_tW_top_5f_inclusiveDecays_TuneCP5_13TeV-powheg-pythia8 Link 35.60
ST_s-channel_4f_leptonDecays_TuneCP5_13TeV-amcatnlo-pythia8 Link 3.360

Fakes (\(W\)+jets, semi-leptonic \(t\bar{t}\))

Dataset CERN Open Data \(\sigma\) (pb)
TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia8 Link 364.35
WJetsToLNu_TuneCP5_13TeV-madgraphMLM-pythia8 Link 61526.7

Diboson (WZ, ZZ)

Dataset CERN Open Data \(\sigma\) (pb)
WZTo2Q2L_mllmin4p0_TuneCP5_13TeV-amcatnloFXFX-pythia8 Link 5.5950
WZTo3LNu_mllmin4p0_TuneCP5_13TeV-powheg-pythia8 Link 4.42965
ZZ_TuneCP5_13TeV-pythia8 Link 16.52300

WW

Dataset CERN Open Data \(\sigma\) (pb)
WWTo2L2Nu_TuneCP5_13TeV-powheg-pythia8 Link 12.178

ggWW

Dataset CERN Open Data \(\sigma\) (pb)
GluGluToWWToENEN_TuneCP5_13TeV_MCFM701_pythia8 Link 0.06387
GluGluToWWToENMN_TuneCP5_13TeV_MCFM701_pythia8 Link 0.06387
GluGluToWWToENTN_TuneCP5_13TeV_MCFM701_pythia8 Link 0.06387
GluGluToWWToMNEN_TuneCP5_13TeV_MCFM701_pythia8 Link 0.06387
GluGluToWWToMNMN_TuneCP5_13TeV_MCFM701_pythia8 Link 0.06387
GluGluToWWToMNTN_TuneCP5_13TeV_MCFM701_pythia8 Link 0.06387
GluGluToWWToTNEN_TuneCP5_13TeV_MCFM701_pythia8 Link 0.06387
GluGluToWWToTNMN_TuneCP5_13TeV_MCFM701_pythia8 Link 0.06387
GluGluToWWToTNTN_TuneCP5_13TeV_MCFM701_pythia8 Link 0.06387

V+\(\gamma\)

Dataset CERN Open Data \(\sigma\) (pb)
ZGToLLG_01J_5f_TuneCP5_13TeV-amcatnloFXFX-pythia8 Link 58.83
WGToLNuG_TuneCP5_13TeV-madgraphMLM-pythia8 Link 405.271

3. MC Normalisation

MC samples are generated with arbitrary statistics that do not automatically match the data luminosity. Each simulated event is assigned a weight to correct for this:

\[\text{Scale Factor} = \frac{\sigma \times \mathcal{L}_{\text{int}} \times \text{genWeight}}{\sum \text{genWeight}}\]
Symbol Meaning
\(\text{genWeight}\) Per-event generator weight (positive or negative)
\(\sigma\) Process cross section in pb (see tables above)
\(\mathcal{L}_{\text{int}}\) Integrated luminosity: \(16.39\,\text{fb}^{-1}\)
\({\sum \text{genWeight}}\) Sum of all generator weights in the sample

3.1 Sum of Generator Weights

The denominator \({\sum \text{genWeight}}\) must be computed before any selection is applied, using all events in the original dataset. Since MC events can carry negative generator weights (due to NLO subtractions), the sum is not simply equal to the total number of events.

Sum of weights vs. number of events

Always use the sum of genWeight (not raw event counts) in the normalisation denominator. Failing to do this with NLO samples will produce incorrect overall normalisation.

The computation is handled in the xsec_weights.ipynb notebook, which reads the sample file lists and outputs a lookup dictionary of \({\sum \text{genWeight}}\) per sample.

4. File Lists

Sample ROOT file lists are stored under Datasets/:

Datasets/

  • Higgs.txt — Signal
  • WW.txt — Continuum WW
  • ggWW.txt — Loop-induced WW
  • DYtoLL.txt — Drell-Yan
  • Top.txt\(t\bar{t}\) + Single Top
  • Fakes.txt — W+jets, semi-leptonic \(t\bar{t}\)
  • VZ.txt — WZ, ZZ
  • VG.txt\(W+\gamma\), \(Z+\gamma\)

Each file contains XRootD paths in the format:

root://eospublic.cern.ch//eos/opendata/cms/mc/RunIISummer20UL16NanoAODv9/...

Reference

The cross-section numbers are sourced from here.