Module 7 Week B — Integration Task: Summarization & Integrated Evaluation Report

In this integration you will apply a pre-trained encoder–decoder summarization model to 120 tech / entertainment news articles, compute ROUGE (Recall-Oriented Understudy for Gisting Evaluation) against the reference summaries that ship with the dataset (data/tech_news_summaries_reference.csv in your repo), and then write the integrated evaluation report — the synthesis that combines your Lab 7A fine-tuned classifier results, your Lab 7B QA (Question-Answering) results, and this task’s summarization results into a single comparison report. The integrated evaluation report is the Module 7 deliverable.

Due: Thursday end of day. Resubmissions are accepted through Saturday of the assignment week.

What You Will Produce

Committed to your repo:

summarize.py — your implementation of the summarization pipeline.
Makefile — provided in the starter; you use make summarize, make smoke, make clean.
Updated README.md — 1–2 paragraphs documenting the model id, the tech news corpus version, and the re-run command.
summary_predictions.csv — 120 rows with reference summary, predicted summary, and per-summary ROUGE.
summary_metrics.json — aggregate ROUGE-1, ROUGE-2, ROUGE-L F1 with n and the model id.
integrated-evaluation-report.md — the six-section integrated report (the M7 deliverable).

The summarization model is loaded from Hugging Face Hub at runtime. No model file in your repo.

Setup

Accept the assignment (link added at Step 9c — Classroom setup)

Clone your repo:

git clone <your-repo-url>
cd <your-repo-name>

Install dependencies:
```
pip install -r requirements.txt
```
This installs transformers, torch (CPU build), pandas, and rouge-score. The first three are already installed; rouge-score is the same package you installed for the drill (this is a separate repo, so a fresh pip install is needed).

Create your working branch:

git checkout -b integration-7b-summary-eval

First run: summarize.py constructs a summarization pipeline, which downloads ~250 MB of model weights on first run. Plan ~3 minutes for the first run; subsequent runs use cached weights. The full evaluation on 120 articles completes in ~6–8 minutes on CPU after the model is cached.

What You’re Building

summarize.py runs the full summarization evaluation end-to-end:

get_summarizer_model_name()    # provided helper
build_summarizer(...)          # TODO (same as drill)
summarize_one(...)             # TODO (same as drill, with do_sample=False, num_beams=4)
compute_rouge(...)             # TODO (same as drill)
evaluate_summaries(...)        # TODO (new — orchestrates the harness)
main()                         # TODO (orchestration)

The starter summarize.py includes get_summarizer_model_name(), which returns os.environ.get("SUMM_MODEL_FOR_CI") if set, otherwise "sshleifer/distilbart-cnn-6-6". Don’t modify it — the autograder’s smoke test depends on it.

Task 1: Pipeline + Single-Document Summarization

Implement build_summarizer(model_name) and summarize_one(summ, text, max_length=120, min_length=30) exactly as in the drill. Use do_sample=False and num_beams=4 inside summarize_one.

These are the foundation. The autograder verifies them before checking the harness.

Task 2: ROUGE Scoring

Implement compute_rouge(pred, ref) exactly as in the drill — return {"rouge1": float, "rouge2": float, "rougeL": float} with F1 measures, using a RougeScorer with use_stemmer=True. Remember the argument order: scorer.score(reference, predicted).

Task 3: Evaluate Over the Corpus

Implement evaluate_summaries(summ, articles_df, refs_df):

articles_df has columns article_id, text (the tech news articles).
refs_df has columns article_id, reference_summary.
Join on article_id; for each pair, summarize the article and compute ROUGE against the reference.

Return a dict:

{
    "rouge1": float,        # mean ROUGE-1 F1
    "rouge2": float,        # mean ROUGE-2 F1
    "rougeL": float,        # mean ROUGE-L F1
    "n": int,
    "predictions": [
        {
            "article_id": str,
            "reference_summary": str,
            "predicted_summary": str,
            "rouge1": float,
            "rouge2": float,
            "rougeL": float,
        },
        ...
    ],
}

def evaluate_summaries(summ, articles_df, refs_df) -> dict:
    """
    Summarize each article and score against its reference.

    Returns aggregate ROUGE-1/2/L F1 plus per-article predictions.
    """
    # TODO: join the two DataFrames on article_id
    # TODO: iterate, summarize, compute ROUGE
    # TODO: aggregate and return

Task 4: Orchestrate

Implement main():

Read input paths from environment variables: ARTICLES_PATH (default data/tech_news_articles.csv), REFERENCES_PATH (default data/tech_news_summaries_reference.csv), OUTPUT_PATH (default summary_predictions.csv).
Load both CSVs into DataFrames.
Build the summarizer via build_summarizer(get_summarizer_model_name()).
Run evaluate_summaries.
Write summary_predictions.csv (columns: article_id, reference_summary, predicted_summary, rouge1, rouge2, rougeL).
Write summary_metrics.json (rouge1, rouge2, rougeL, n, model).

Then run:

make summarize

This executes python summarize.py against the full 120-article evaluation set. ~6–8 minutes on CPU after the model weights are cached.

Task 5: Update the README

Edit README.md to add 1–2 paragraphs documenting:

The summarization model name (default sshleifer/distilbart-cnn-6-6) and a one-line description of what it is.
The corpus version (120 tech news articles from M6) and the reference summaries file.
The re-run command (make summarize).

This is the reproducibility-minimum requirement — anyone reading your repo should be able to re-run your evaluation.

Task 6: The Integrated Evaluation Report — the M7 Deliverable

This is the central artifact. The integrated-evaluation-report.md synthesizes three weeks of measurement:

Lab 7A — your fine-tuned DistilBERT classifier on AARSynth app reviews. Use the macro-F1 from your metrics.json.
Integration 7A — the same fine-tuned classifier applied to the tech news corpus (domain shift; included for engineering context).
Lab 7B — pre-trained QA on the curated tech-news QA set. Use EM and token-F1 from your qa_metrics.json.
Integration 7B (this task) — pre-trained summarization on the tech news corpus. Use ROUGE from your summary_metrics.json.

Manually paste the relevant numbers from each metrics file into the report’s comparison table. The TA verifies that the report’s numbers match your submitted metrics files.

The report has six required sections. The starter includes a template integrated-evaluation-report.md with these section headers — fill them in.

Section 1: Comparison Table

Task	Approach	Model	Training cost	Inference cost	Quality metric	Value
Sentiment classification (Lab 7A)	Fine-tuning	DistilBERT	~30 min CPU + 3K labels	~50 ms / example	Macro-F1	(from your `metrics.json`)
Domain transfer of fine-tuned classifier (Integration 7A)	Fine-tuned model on out-of-domain	(same)	(already trained)	~50 ms / example	Domain-shift judgment	(qualitative — from Integration 7A)
Extractive QA (Lab 7B)	Pre-trained inference	distilbert-base-cased-distilled-squad	0	~50 ms / example	EM / token-F1	(from your `qa_metrics.json`)
Abstractive summarization (Integration 7B)	Pre-trained inference	distilbart-cnn-6-6	0	~3 sec / example	ROUGE-1/2/L F1	(from your `summary_metrics.json`)

Section 2: Findings

3–5 bullet points characterizing what each approach excels at and where it breaks. Examples of the kind of finding to write:

“Fine-tuning produced a 0.87 macro-F1 on app-review sentiment but suffered visible domain-shift breakdown on tech news articles (Integration 7A).”
“Pre-trained QA achieved [your EM] / [your F1] on the curated tech-news QA set; [your failure-mode insight from Lab 7B].”

Make the bullets specific to your numbers.

Section 3: Faithfulness Check (qualitative)

Pick three summaries from summary_predictions.csv (one high-ROUGE, one mid-ROUGE, one low-ROUGE). For each:

Quote the article and the predicted summary.
Mark whether the summary is faithful (every claim appears in the article).
Comment on what ROUGE caught or missed for this summary.

This is the section that puts the limits of ROUGE on the page concretely. A high-ROUGE summary that hallucinates is the canonical example to call out — if you find one, this section is gold.

Section 4: Production Decision Matrix

For each of three hypothetical production scenarios, recommend either fine-tuning or pre-trained inference, with one specific sentence justifying the recommendation grounded in your measured numbers.

Scenario	Recommendation	Justification (1 sentence, tied to your numbers)
Real-time app-review sentiment dashboard for trading desk	(your call)	(your justification)
Internal tech / entertainment news summary digest for a newsroom team	(your call)	(your justification)
Domain-expert QA on legal contracts	(your call)	(your justification)

A weak version of this section says “it depends” everywhere. A strong version specifies what makes it depend — labeled-data availability, latency budget, faithfulness requirement.

Section 5: What You Would Do Differently

One paragraph on what you would change about your approach if you had a labeled summarization dataset for the tech / entertainment news domain. Concrete: would you fine-tune a summarizer? Train a calibration model on top? Invest in a faithfulness audit? The goal is to articulate one engineering investment that would meaningfully change the numbers.

Section 6: Limits of the Evaluation

One paragraph on what these numbers do not tell you. Examples of what to address:

ROUGE doesn’t capture faithfulness — your faithfulness check (Section 3) addresses some of this, but only on three summaries.
EM / token-F1 don’t capture calibration — a model can be 80% accurate and 100% confident on every output.
Latency under load isn’t measured — these are single-request numbers.

Don’t list every possible limit. Pick the one or two that matter most for the production scenarios you discussed in Section 4.

Submitting

When everything is done:

Stage and commit:

git add summarize.py Makefile README.md summary_metrics.json summary_predictions.csv integrated-evaluation-report.md
git commit -m "Complete integration 7B"

Push to the remote repository.
Open a Pull Request from your working branch into main. The autograder runs make smoke against a tiny fixture (3 articles + same reference file) using a substitute model — this verifies your pipeline is structurally correct end-to-end. It does not re-run your full 120-article evaluation.

Your PR description must include:

Aggregate ROUGE-1, ROUGE-2, ROUGE-L F1 (two decimals each).
One-sentence summary of your production decision matrix’s central recommendation (e.g., “Pre-trained inference is sufficient for low-stakes summarization; fine-tuning is justified for the trading desk classifier where labeled data exists.”).
Paste your PR URL into TalentLMS → Module 7 → Integration 7B to submit this assignment.

What’s Next

Tomorrow (Thursday) — Honors Track only: Stretch — Summarize-then-QA two-step pipeline (extends this integration on branch stretch-thu-summarize-then-qa).
Today (Thursday EOD): Reflection due. Peer review on Lab 7B due. Readiness Check 2 due.
Module 8 (T11, Sun 31 May 2026): RAG. The QA pipeline you built in Lab 7B becomes the generator side of a retrieval-augmented system. EID-2 (Sun 24 May 2026) is the buffer week between Module 7 and Module 8.