Introduction#
In this project, we built a small AI-driven application that can provide an initial, guidance-oriented assessment of a student internship report. The purpose was not to create an “automatic true grading system,” but to explore how an external LLM API can become part of a concrete data flow inside an application.
We were given three kinds of material: learning objectives, report requirements, and the description of Dare-Share-Care. Based on that, we had to derive our own rubric, design prompts, build a backend call to a language model, and return a structured response. We also chose to build a simple frontend in React, so the user can either paste text directly or upload a .md or .txt file.
We used a coding agent as part of the development process. That helped us move faster from idea to working prototype, but it did not remove the need for our own decisions. We still had to decide on the rubric, prompt design, output format, and what the system should actually be used for. The coding agent was especially helpful for boilerplate, structure, and fast iteration, while our main contribution was defining the requirements and evaluating the quality of the solution.
Problem and Goal#
The goal of the solution was to build an application that can:
- receive a report text
- apply a rubric derived from the assessment material
- send a prompt to an LLM via API
- receive a response in a structured format
- return feedback that can be used as a starting point for further dialogue
It was important to us that the output should be presented as guidance-oriented feedback, not as a final grade or official assessment.
Our Rubric#
We derived the rubric directly from the three source documents. Instead of trying to assess “everything,” we chose five criteria that covered the central requirements in the material:
Company context and daily practice
Does the report describe the company, the work context, and the student’s insight into day-to-day operations?Tasks, methods, and technical work
Does the report explain concrete tasks, methods, technologies, and technical choices?Learning goals and theory-practice link
Does the report clearly show how learning objectives and theory from the education were connected to practice?Reflection and personal development
Does the report contain real reflection on development, challenges, and learning?Value creation and Dare-Share-Care
Does the report show what value the student created, and how Dare, Share, and Care are demonstrated?
For each criterion, we described low, medium, and high goal fulfillment. We also assigned weights, so technical work, learning goals, and reflection had the greatest importance.
Prompt Design#
We worked with two prompts: a system prompt and a user prompt.
System Prompt#
The system prompt defined the model as an academic assistant that should provide an initial assessment of an internship report. It instructed the model to:
- only use the rubric and the provided report text
- not act as an official examiner
- make uncertainty explicit instead of guessing
- return the answer as structured JSON
- keep a constructive and concrete tone
User Prompt#
The user prompt contained:
- the rubric in structured form
- instructions on how the criteria should be applied
- the report text itself
What worked well was that the rubric was included with every request, because the model then had the criteria explicitly available instead of relying on an implicit understanding.
Endpoint Design#
We built a small backend with one main endpoint:
POST /api/evaluations
Here, the client sends the report text, and the backend:
- builds the system prompt and user prompt
- sends the request to the OpenAI Responses API
- receives a structured JSON response
- returns the result to the frontend or client
We also added:
GET /healthGET /api/rubric
This made the solution more transparent and easier to test.
Example Request and Response#
Request#
{
"reportText": "insert report text here",
"model": "gpt-4.1-mini"
}Response#
{
"rubricTitle": "Internship report evaluation rubric",
"model": "gpt-4.1-mini",
"generatedAt": "2026-04-27T12:00:00.000Z",
"weightedScore": 3.8,
"evaluation": {
"overallLevel": "medium",
"overallSummary": "The report is generally well written and concrete, but some parts could be more strongly linked to the learning objectives.",
"criteria": [
{
"id": "learning_goals_and_theory",
"title": "Learning goals and theory-practice link",
"level": "medium",
"score": 4,
"justification": "The report connects several experiences to the education, but not all learning objectives are covered equally clearly.",
"evidence": [
"Describes the use of agile working methods",
"Connects tasks to theories from the education"
],
"improvementSuggestion": "Make it clearer how each learning objective was specifically fulfilled."
}
],
"strengths": [
"Concrete descriptions of tasks",
"Good technical insight",
"Clear personal reflection"
],
"weaknesses": [
"Uneven coverage of learning objectives",
"Some judgments require more explicit documentation"
],
"improvementSuggestions": [
"Structure the report more clearly around the learning objectives",
"Add more concrete examples of value creation"
],
"dialogQuestions": [
"Which learning objective do you think you fulfilled best?",
"Where did you experience the greatest professional development?"
],
"uncertainties": [
"It is unclear whether all learning objectives are explicitly covered"
],
"disclaimer": "This is a guidance-oriented AI-based assessment and not a final grading."
}
}Frontend#
We also chose to build a simple frontend in React. Here, the user can:
- paste report text directly into a text field
- upload a markdown or text file
- send the text to the API
- view the structured assessment in a more readable layout
This made the project more complete, because it clearly showed the full flow from input to AI-generated feedback.
What Worked Well#
What worked best was the combination of a clear rubric and structured output. When the criteria were precise, the responses also became more useful. It especially helped that the model was asked to return feedback per criterion instead of only one large block of text.
We also found value in asking the model to highlight:
- strengths
- weaknesses
- improvement suggestions
- questions for further dialogue
That made the output more useful in an educational context.
What Worked Less Well#
The biggest challenge was that quality still depends heavily on how clearly the report is written. If a student only shows a learning objective implicitly, the model may overlook it or be uncertain about it. That means the AI assessment is not necessarily “wrong,” but it can become uneven if the input is unclear.
Another challenge was that the model can sometimes sound more confident than it should. Because of that, it was important for us to force uncertainty into the output format and continuously emphasize that the result is guidance-oriented.
Reflection on Using a Coding Agent#
We used a coding agent during the development process, and it is important to state that openly. It helped us set up the backend, frontend, and file structure quickly, and it made it easier to iterate on the implementation.
At the same time, the project also showed the limitation of such a tool: the agent can help write code, but it cannot decide what makes a good rubric, what is pedagogically responsible, or how an AI assessment should best be presented in an educational context. Those choices still required human judgment.
For us, the coding agent became mainly a productivity tool, not a replacement for design decisions or reflection.
What We Would Improve in the Next Version#
If we were to continue developing the solution, we would like to:
- add the ability to choose between multiple rubrics
- improve error messages for API timeouts or invalid API keys
- display the rubric directly in the frontend
- store previous assessments
- compare output from multiple models or prompts
- make the assessment more traceable by showing quotes or text excerpts from the report behind each judgment
We could also imagine a version where the teacher can adjust criteria and weights without changing the code.
Conclusion#
The project showed that it is relatively easy to get a language model to return an answer, but much harder to design a solution where the answer is actually useful. The most important part of the work was therefore not the API call itself, but the translation from assessment material into rubric, prompt, and structured feedback.
The result was a small, functional prototype that demonstrates how an LLM can be used as part of a larger data flow inside an application. At the same time, it became clear that AI in this context works best as support for reflection and dialogue, not as an automatic grader.
