Module 7 Week B — Integration Task: Summarization & Integrated Evaluation Report

In this integration you will apply a pre-trained encoder–decoder summarization model to 120 tech / entertainment news articles, compute ROUGE (Recall-Oriented Understudy for Gisting Evaluation) against the reference summaries that ship with the dataset (data/tech_news_summaries_reference.csv in your repo), and then write the integrated evaluation report — the synthesis that combines your Lab 7A fine-tuned classifier results, your Lab 7B QA (Question-Answering) results, and this task’s summarization results into a single comparison report. The integrated evaluation report is the Module 7 deliverable.

Due: Thursday end of day. Resubmissions are accepted through Saturday of the assignment week.


What You Will Produce

Committed to your repo:

  1. summarize.py — your implementation of the summarization pipeline.
  2. Makefile — provided in the starter; you use make summarize, make smoke, make clean.
  3. Updated README.md — 1–2 paragraphs documenting the model id, the tech news corpus version, and the re-run command.
  4. summary_predictions.csv — 120 rows with reference summary, predicted summary, and per-summary ROUGE.
  5. summary_metrics.json — aggregate ROUGE-1, ROUGE-2, ROUGE-L F1 with n and the model id.
  6. integrated-evaluation-report.md — the six-section integrated report (the M7 deliverable).

The summarization model is loaded from Hugging Face Hub at runtime. No model file in your repo.


Setup

  1. Accept the assignment (link added at Step 9c — Classroom setup)
  2. Clone your repo:
    git clone <your-repo-url>
    cd <your-repo-name>
    
  3. Install dependencies:
    pip install -r requirements.txt
    

    This installs transformers, torch (CPU build), pandas, and rouge-score. The first three are already installed; rouge-score is the same package you installed for the drill (this is a separate repo, so a fresh pip install is needed).

  4. Create your working branch:
    git checkout -b integration-7b-summary-eval
    

First run: summarize.py constructs a summarization pipeline, which downloads ~250 MB of model weights on first run. Plan ~3 minutes for the first run; subsequent runs use cached weights. The full evaluation on 120 articles completes in ~6–8 minutes on CPU after the model is cached.


What You’re Building

summarize.py runs the full summarization evaluation end-to-end:

get_summarizer_model_name()    # provided helper
build_summarizer(...)          # TODO (same as drill)
summarize_one(...)             # TODO (same as drill, with do_sample=False, num_beams=4)
compute_rouge(...)             # TODO (same as drill)
evaluate_summaries(...)        # TODO (new — orchestrates the harness)
main()                         # TODO (orchestration)

The starter summarize.py includes get_summarizer_model_name(), which returns os.environ.get("SUMM_MODEL_FOR_CI") if set, otherwise "sshleifer/distilbart-cnn-6-6". Don’t modify it — the autograder’s smoke test depends on it.


Task 1: Pipeline + Single-Document Summarization

Implement build_summarizer(model_name) and summarize_one(summ, text, max_length=120, min_length=30) exactly as in the drill. Use do_sample=False and num_beams=4 inside summarize_one.

These are the foundation. The autograder verifies them before checking the harness.


Task 2: ROUGE Scoring

Implement compute_rouge(pred, ref) exactly as in the drill — return {"rouge1": float, "rouge2": float, "rougeL": float} with F1 measures, using a RougeScorer with use_stemmer=True. Remember the argument order: scorer.score(reference, predicted).


Task 3: Evaluate Over the Corpus

Implement evaluate_summaries(summ, articles_df, refs_df):

def evaluate_summaries(summ, articles_df, refs_df) -> dict:
    """
    Summarize each article and score against its reference.

    Returns aggregate ROUGE-1/2/L F1 plus per-article predictions.
    """
    # TODO: join the two DataFrames on article_id
    # TODO: iterate, summarize, compute ROUGE
    # TODO: aggregate and return

Task 4: Orchestrate

Implement main():

Then run:

make summarize

This executes python summarize.py against the full 120-article evaluation set. ~6–8 minutes on CPU after the model weights are cached.


Task 5: Update the README

Edit README.md to add 1–2 paragraphs documenting:

This is the reproducibility-minimum requirement — anyone reading your repo should be able to re-run your evaluation.


Task 6: The Integrated Evaluation Report — the M7 Deliverable

This is the central artifact. The integrated-evaluation-report.md synthesizes three weeks of measurement:

Manually paste the relevant numbers from each metrics file into the report’s comparison table. The TA verifies that the report’s numbers match your submitted metrics files.

The report has six required sections. The starter includes a template integrated-evaluation-report.md with these section headers — fill them in.

Section 1: Comparison Table

Task Approach Model Training cost Inference cost Quality metric Value
Sentiment classification (Lab 7A) Fine-tuning DistilBERT ~30 min CPU + 3K labels ~50 ms / example Macro-F1 (from your metrics.json)
Domain transfer of fine-tuned classifier (Integration 7A) Fine-tuned model on out-of-domain (same) (already trained) ~50 ms / example Domain-shift judgment (qualitative — from Integration 7A)
Extractive QA (Lab 7B) Pre-trained inference distilbert-base-cased-distilled-squad 0 ~50 ms / example EM / token-F1 (from your qa_metrics.json)
Abstractive summarization (Integration 7B) Pre-trained inference distilbart-cnn-6-6 0 ~3 sec / example ROUGE-1/2/L F1 (from your summary_metrics.json)

Section 2: Findings

3–5 bullet points characterizing what each approach excels at and where it breaks. Examples of the kind of finding to write:

Make the bullets specific to your numbers.

Section 3: Faithfulness Check (qualitative)

Pick three summaries from summary_predictions.csv (one high-ROUGE, one mid-ROUGE, one low-ROUGE). For each:

This is the section that puts the limits of ROUGE on the page concretely. A high-ROUGE summary that hallucinates is the canonical example to call out — if you find one, this section is gold.

Section 4: Production Decision Matrix

For each of three hypothetical production scenarios, recommend either fine-tuning or pre-trained inference, with one specific sentence justifying the recommendation grounded in your measured numbers.

Scenario Recommendation Justification (1 sentence, tied to your numbers)
Real-time app-review sentiment dashboard for trading desk (your call) (your justification)
Internal tech / entertainment news summary digest for a newsroom team (your call) (your justification)
Domain-expert QA on legal contracts (your call) (your justification)

A weak version of this section says “it depends” everywhere. A strong version specifies what makes it depend — labeled-data availability, latency budget, faithfulness requirement.

Section 5: What You Would Do Differently

One paragraph on what you would change about your approach if you had a labeled summarization dataset for the tech / entertainment news domain. Concrete: would you fine-tune a summarizer? Train a calibration model on top? Invest in a faithfulness audit? The goal is to articulate one engineering investment that would meaningfully change the numbers.

Section 6: Limits of the Evaluation

One paragraph on what these numbers do not tell you. Examples of what to address:

Don’t list every possible limit. Pick the one or two that matter most for the production scenarios you discussed in Section 4.


Submitting

When everything is done:

  1. Stage and commit:
    git add summarize.py Makefile README.md summary_metrics.json summary_predictions.csv integrated-evaluation-report.md
    git commit -m "Complete integration 7B"
    
  2. Push to the remote repository.
  3. Open a Pull Request from your working branch into main. The autograder runs make smoke against a tiny fixture (3 articles + same reference file) using a substitute model — this verifies your pipeline is structurally correct end-to-end. It does not re-run your full 120-article evaluation.

Your PR description must include:

  1. Aggregate ROUGE-1, ROUGE-2, ROUGE-L F1 (two decimals each).
  2. One-sentence summary of your production decision matrix’s central recommendation (e.g., “Pre-trained inference is sufficient for low-stakes summarization; fine-tuning is justified for the trading desk classifier where labeled data exists.”).
  3. Paste your PR URL into TalentLMS → Module 7 → Integration 7B to submit this assignment.

What’s Next