AI Assessment of Internship Reports with an LLM API

Introduction
#

In this project, we built a small AI-driven application that can provide an initial, guidance-oriented assessment of a student internship report. The purpose was not to create an “automatic true grading system,” but to explore how an external LLM API can become part of a concrete data flow inside an application.

We were given three kinds of material: learning objectives, report requirements, and the description of Dare-Share-Care. Based on that, we had to derive our own rubric, design prompts, build a backend call to a language model, and return a structured response. We also chose to build a simple frontend in React, so the user can either paste text directly or upload a .md or .txt file.

We used a coding agent as part of the development process. That helped us move faster from idea to working prototype, but it did not remove the need for our own decisions. We still had to decide on the rubric, prompt design, output format, and what the system should actually be used for. The coding agent was especially helpful for boilerplate, structure, and fast iteration, while our main contribution was defining the requirements and evaluating the quality of the solution.

Problem and Goal
#

The goal of the solution was to build an application that can:

receive a report text
apply a rubric derived from the assessment material
send a prompt to an LLM via API
receive a response in a structured format
return feedback that can be used as a starting point for further dialogue

It was important to us that the output should be presented as guidance-oriented feedback, not as a final grade or official assessment.

Our Rubric
#

We derived the rubric directly from the three source documents. Instead of trying to assess “everything,” we chose five criteria that covered the central requirements in the material:

Company context and daily practice
Does the report describe the company, the work context, and the student’s insight into day-to-day operations?
Tasks, methods, and technical work
Does the report explain concrete tasks, methods, technologies, and technical choices?
Learning goals and theory-practice link
Does the report clearly show how learning objectives and theory from the education were connected to practice?
Reflection and personal development
Does the report contain real reflection on development, challenges, and learning?
Value creation and Dare-Share-Care
Does the report show what value the student created, and how Dare, Share, and Care are demonstrated?

For each criterion, we described low, medium, and high goal fulfillment. We also assigned weights, so technical work, learning goals, and reflection had the greatest importance.

Prompt Design
#

We worked with two prompts: a system prompt and a user prompt.

System Prompt
#

The system prompt defined the model as an academic assistant that should provide an initial assessment of an internship report. It instructed the model to:

only use the rubric and the provided report text
not act as an official examiner
make uncertainty explicit instead of guessing
return the answer as structured JSON
keep a constructive and concrete tone

User Prompt
#

The user prompt contained:

the rubric in structured form
instructions on how the criteria should be applied
the report text itself

What worked well was that the rubric was included with every request, because the model then had the criteria explicitly available instead of relying on an implicit understanding.

Endpoint Design
#

We built a small backend with one main endpoint:

POST /api/evaluations

Here, the client sends the report text, and the backend:

builds the system prompt and user prompt
sends the request to the OpenAI Responses API
receives a structured JSON response
returns the result to the frontend or client

We also added:

GET /health
GET /api/rubric

This made the solution more transparent and easier to test.

Example Request and Response
#

Request
#

{
  "reportText": "insert report text here",
  "model": "gpt-4.1-mini"
}

Response
#

{
  "rubricTitle": "Internship report evaluation rubric",
  "model": "gpt-4.1-mini",
  "generatedAt": "2026-04-27T12:00:00.000Z",
  "weightedScore": 3.8,
  "evaluation": {
    "overallLevel": "medium",
    "overallSummary": "The report is generally well written and concrete, but some parts could be more strongly linked to the learning objectives.",
    "criteria": [
      {
        "id": "learning_goals_and_theory",
        "title": "Learning goals and theory-practice link",
        "level": "medium",
        "score": 4,
        "justification": "The report connects several experiences to the education, but not all learning objectives are covered equally clearly.",
        "evidence": [
          "Describes the use of agile working methods",
          "Connects tasks to theories from the education"
        ],
        "improvementSuggestion": "Make it clearer how each learning objective was specifically fulfilled."
      }
    ],
    "strengths": [
      "Concrete descriptions of tasks",
      "Good technical insight",
      "Clear personal reflection"
    ],
    "weaknesses": [
      "Uneven coverage of learning objectives",
      "Some judgments require more explicit documentation"
    ],
    "improvementSuggestions": [
      "Structure the report more clearly around the learning objectives",
      "Add more concrete examples of value creation"
    ],
    "dialogQuestions": [
      "Which learning objective do you think you fulfilled best?",
      "Where did you experience the greatest professional development?"
    ],
    "uncertainties": [
      "It is unclear whether all learning objectives are explicitly covered"
    ],
    "disclaimer": "This is a guidance-oriented AI-based assessment and not a final grading."
  }
}

Frontend
#

We also chose to build a simple frontend in React. Here, the user can:

paste report text directly into a text field
upload a markdown or text file
send the text to the API
view the structured assessment in a more readable layout

This made the project more complete, because it clearly showed the full flow from input to AI-generated feedback.

What Worked Well
#

What worked best was the combination of a clear rubric and structured output. When the criteria were precise, the responses also became more useful. It especially helped that the model was asked to return feedback per criterion instead of only one large block of text.

We also found value in asking the model to highlight:

strengths
weaknesses
improvement suggestions
questions for further dialogue

That made the output more useful in an educational context.

What Worked Less Well
#

The biggest challenge was that quality still depends heavily on how clearly the report is written. If a student only shows a learning objective implicitly, the model may overlook it or be uncertain about it. That means the AI assessment is not necessarily “wrong,” but it can become uneven if the input is unclear.

Another challenge was that the model can sometimes sound more confident than it should. Because of that, it was important for us to force uncertainty into the output format and continuously emphasize that the result is guidance-oriented.

Reflection on Using a Coding Agent
#

We used a coding agent during the development process, and it is important to state that openly. It helped us set up the backend, frontend, and file structure quickly, and it made it easier to iterate on the implementation.

At the same time, the project also showed the limitation of such a tool: the agent can help write code, but it cannot decide what makes a good rubric, what is pedagogically responsible, or how an AI assessment should best be presented in an educational context. Those choices still required human judgment.

For us, the coding agent became mainly a productivity tool, not a replacement for design decisions or reflection.

What We Would Improve in the Next Version
#

If we were to continue developing the solution, we would like to:

add the ability to choose between multiple rubrics
improve error messages for API timeouts or invalid API keys
display the rubric directly in the frontend
store previous assessments
compare output from multiple models or prompts
make the assessment more traceable by showing quotes or text excerpts from the report behind each judgment

We could also imagine a version where the teacher can adjust criteria and weights without changing the code.

Conclusion
#

The project showed that it is relatively easy to get a language model to return an answer, but much harder to design a solution where the answer is actually useful. The most important part of the work was therefore not the API call itself, but the translation from assessment material into rubric, prompt, and structured feedback.

The result was a small, functional prototype that demonstrates how an LLM can be used as part of a larger data flow inside an application. At the same time, it became clear that AI in this context works best as support for reflection and dialogue, not as an automatic grader.

Introduction#

Problem and Goal#

Our Rubric#

Prompt Design#

System Prompt#

User Prompt#

Endpoint Design#

Example Request and Response#

Request#

Response#

Frontend#

What Worked Well#

What Worked Less Well#

Reflection on Using a Coding Agent#

What We Would Improve in the Next Version#

Conclusion#