Module 7 Week B — Integration Task: Summarization & Integrated Evaluation Report
In this integration you will apply a pre-trained encoder–decoder summarization model to 120 tech / entertainment news articles, compute ROUGE (Recall-Oriented Understudy for Gisting Evaluation) against the reference summaries that ship with the dataset (data/tech_news_summaries_reference.csv in your repo), and then write the integrated evaluation report — the synthesis that combines your Lab 7A fine-tuned classifier results, your Lab 7B QA (Question-Answering) results, and this task’s summarization results into a single comparison report. The integrated evaluation report is the Module 7 deliverable.
Due: Thursday end of day. Resubmissions are accepted through Saturday of the assignment week.
What You Will Produce
Committed to your repo:
summarize.py— your implementation of the summarization pipeline.Makefile— provided in the starter; you usemake summarize,make smoke,make clean.- Updated
README.md— 1–2 paragraphs documenting the model id, the tech news corpus version, and the re-run command. summary_predictions.csv— 120 rows with reference summary, predicted summary, and per-summary ROUGE.summary_metrics.json— aggregate ROUGE-1, ROUGE-2, ROUGE-L F1 withnand the model id.integrated-evaluation-report.md— the six-section integrated report (the M7 deliverable).
The summarization model is loaded from Hugging Face Hub at runtime. No model file in your repo.
Setup
- Accept the assignment (link added at Step 9c — Classroom setup)
- Clone your repo:
git clone <your-repo-url> cd <your-repo-name> - Install dependencies:
pip install -r requirements.txtThis installs
transformers,torch(CPU build),pandas, androuge-score. The first three are already installed;rouge-scoreis the same package you installed for the drill (this is a separate repo, so a freshpip installis needed). - Create your working branch:
git checkout -b integration-7b-summary-eval
First run:
summarize.pyconstructs a summarization pipeline, which downloads ~250 MB of model weights on first run. Plan ~3 minutes for the first run; subsequent runs use cached weights. The full evaluation on 120 articles completes in ~6–8 minutes on CPU after the model is cached.
What You’re Building
summarize.py runs the full summarization evaluation end-to-end:
get_summarizer_model_name() # provided helper
build_summarizer(...) # TODO (same as drill)
summarize_one(...) # TODO (same as drill, with do_sample=False, num_beams=4)
compute_rouge(...) # TODO (same as drill)
evaluate_summaries(...) # TODO (new — orchestrates the harness)
main() # TODO (orchestration)
The starter summarize.py includes get_summarizer_model_name(), which returns os.environ.get("SUMM_MODEL_FOR_CI") if set, otherwise "sshleifer/distilbart-cnn-6-6". Don’t modify it — the autograder’s smoke test depends on it.
Task 1: Pipeline + Single-Document Summarization
Implement build_summarizer(model_name) and summarize_one(summ, text, max_length=120, min_length=30) exactly as in the drill. Use do_sample=False and num_beams=4 inside summarize_one.
These are the foundation. The autograder verifies them before checking the harness.
Task 2: ROUGE Scoring
Implement compute_rouge(pred, ref) exactly as in the drill — return {"rouge1": float, "rouge2": float, "rougeL": float} with F1 measures, using a RougeScorer with use_stemmer=True. Remember the argument order: scorer.score(reference, predicted).
Task 3: Evaluate Over the Corpus
Implement evaluate_summaries(summ, articles_df, refs_df):
articles_dfhas columnsarticle_id,text(the tech news articles).refs_dfhas columnsarticle_id,reference_summary.- Join on
article_id; for each pair, summarize the article and compute ROUGE against the reference. -
Return a dict:
{ "rouge1": float, # mean ROUGE-1 F1 "rouge2": float, # mean ROUGE-2 F1 "rougeL": float, # mean ROUGE-L F1 "n": int, "predictions": [ { "article_id": str, "reference_summary": str, "predicted_summary": str, "rouge1": float, "rouge2": float, "rougeL": float, }, ... ], }
def evaluate_summaries(summ, articles_df, refs_df) -> dict:
"""
Summarize each article and score against its reference.
Returns aggregate ROUGE-1/2/L F1 plus per-article predictions.
"""
# TODO: join the two DataFrames on article_id
# TODO: iterate, summarize, compute ROUGE
# TODO: aggregate and return
Task 4: Orchestrate
Implement main():
- Read input paths from environment variables:
ARTICLES_PATH(defaultdata/tech_news_articles.csv),REFERENCES_PATH(defaultdata/tech_news_summaries_reference.csv),OUTPUT_PATH(defaultsummary_predictions.csv). - Load both CSVs into DataFrames.
- Build the summarizer via
build_summarizer(get_summarizer_model_name()). - Run
evaluate_summaries. - Write
summary_predictions.csv(columns:article_id, reference_summary, predicted_summary, rouge1, rouge2, rougeL). - Write
summary_metrics.json(rouge1,rouge2,rougeL,n,model).
Then run:
make summarize
This executes python summarize.py against the full 120-article evaluation set. ~6–8 minutes on CPU after the model weights are cached.
Task 5: Update the README
Edit README.md to add 1–2 paragraphs documenting:
- The summarization model name (default
sshleifer/distilbart-cnn-6-6) and a one-line description of what it is. - The corpus version (120 tech news articles from M6) and the reference summaries file.
- The re-run command (
make summarize).
This is the reproducibility-minimum requirement — anyone reading your repo should be able to re-run your evaluation.
Task 6: The Integrated Evaluation Report — the M7 Deliverable
This is the central artifact. The integrated-evaluation-report.md synthesizes three weeks of measurement:
- Lab 7A — your fine-tuned DistilBERT classifier on AARSynth app reviews. Use the macro-F1 from your
metrics.json. - Integration 7A — the same fine-tuned classifier applied to the tech news corpus (domain shift; included for engineering context).
- Lab 7B — pre-trained QA on the curated tech-news QA set. Use EM and token-F1 from your
qa_metrics.json. - Integration 7B (this task) — pre-trained summarization on the tech news corpus. Use ROUGE from your
summary_metrics.json.
Manually paste the relevant numbers from each metrics file into the report’s comparison table. The TA verifies that the report’s numbers match your submitted metrics files.
The report has six required sections. The starter includes a template integrated-evaluation-report.md with these section headers — fill them in.
Section 1: Comparison Table
| Task | Approach | Model | Training cost | Inference cost | Quality metric | Value |
|---|---|---|---|---|---|---|
| Sentiment classification (Lab 7A) | Fine-tuning | DistilBERT | ~30 min CPU + 3K labels | ~50 ms / example | Macro-F1 | (from your metrics.json) |
| Domain transfer of fine-tuned classifier (Integration 7A) | Fine-tuned model on out-of-domain | (same) | (already trained) | ~50 ms / example | Domain-shift judgment | (qualitative — from Integration 7A) |
| Extractive QA (Lab 7B) | Pre-trained inference | distilbert-base-cased-distilled-squad | 0 | ~50 ms / example | EM / token-F1 | (from your qa_metrics.json) |
| Abstractive summarization (Integration 7B) | Pre-trained inference | distilbart-cnn-6-6 | 0 | ~3 sec / example | ROUGE-1/2/L F1 | (from your summary_metrics.json) |
Section 2: Findings
3–5 bullet points characterizing what each approach excels at and where it breaks. Examples of the kind of finding to write:
- “Fine-tuning produced a 0.87 macro-F1 on app-review sentiment but suffered visible domain-shift breakdown on tech news articles (Integration 7A).”
- “Pre-trained QA achieved [your EM] / [your F1] on the curated tech-news QA set; [your failure-mode insight from Lab 7B].”
Make the bullets specific to your numbers.
Section 3: Faithfulness Check (qualitative)
Pick three summaries from summary_predictions.csv (one high-ROUGE, one mid-ROUGE, one low-ROUGE). For each:
- Quote the article and the predicted summary.
- Mark whether the summary is faithful (every claim appears in the article).
- Comment on what ROUGE caught or missed for this summary.
This is the section that puts the limits of ROUGE on the page concretely. A high-ROUGE summary that hallucinates is the canonical example to call out — if you find one, this section is gold.
Section 4: Production Decision Matrix
For each of three hypothetical production scenarios, recommend either fine-tuning or pre-trained inference, with one specific sentence justifying the recommendation grounded in your measured numbers.
| Scenario | Recommendation | Justification (1 sentence, tied to your numbers) |
|---|---|---|
| Real-time app-review sentiment dashboard for trading desk | (your call) | (your justification) |
| Internal tech / entertainment news summary digest for a newsroom team | (your call) | (your justification) |
| Domain-expert QA on legal contracts | (your call) | (your justification) |
A weak version of this section says “it depends” everywhere. A strong version specifies what makes it depend — labeled-data availability, latency budget, faithfulness requirement.
Section 5: What You Would Do Differently
One paragraph on what you would change about your approach if you had a labeled summarization dataset for the tech / entertainment news domain. Concrete: would you fine-tune a summarizer? Train a calibration model on top? Invest in a faithfulness audit? The goal is to articulate one engineering investment that would meaningfully change the numbers.
Section 6: Limits of the Evaluation
One paragraph on what these numbers do not tell you. Examples of what to address:
- ROUGE doesn’t capture faithfulness — your faithfulness check (Section 3) addresses some of this, but only on three summaries.
- EM / token-F1 don’t capture calibration — a model can be 80% accurate and 100% confident on every output.
- Latency under load isn’t measured — these are single-request numbers.
Don’t list every possible limit. Pick the one or two that matter most for the production scenarios you discussed in Section 4.
Submitting
When everything is done:
- Stage and commit:
git add summarize.py Makefile README.md summary_metrics.json summary_predictions.csv integrated-evaluation-report.md git commit -m "Complete integration 7B" - Push to the remote repository.
- Open a Pull Request from your working branch into
main. The autograder runsmake smokeagainst a tiny fixture (3 articles + same reference file) using a substitute model — this verifies your pipeline is structurally correct end-to-end. It does not re-run your full 120-article evaluation.
Your PR description must include:
- Aggregate ROUGE-1, ROUGE-2, ROUGE-L F1 (two decimals each).
- One-sentence summary of your production decision matrix’s central recommendation (e.g., “Pre-trained inference is sufficient for low-stakes summarization; fine-tuning is justified for the trading desk classifier where labeled data exists.”).
- Paste your PR URL into TalentLMS → Module 7 → Integration 7B to submit this assignment.
What’s Next
- Tomorrow (Thursday) — Honors Track only: Stretch — Summarize-then-QA two-step pipeline (extends this integration on branch
stretch-thu-summarize-then-qa). - Today (Thursday EOD): Reflection due. Peer review on Lab 7B due. Readiness Check 2 due.
- Module 8 (T11, Sun 31 May 2026): RAG. The QA pipeline you built in Lab 7B becomes the generator side of a retrieval-augmented system. EID-2 (Sun 24 May 2026) is the buffer week between Module 7 and Module 8.