European Portuguese NLP Research (2021–2026)

LLMs

Project	Year	Sizes available	What it is	What it advanced	ArXiv paper	ArXiv HTML	Hugging Face model / dataset	Hugging Face org	GitHub repo
Albertina PT family	2023	100M, 900M	DeBERTa-based Portuguese foundation encoder with pt-PT and pt-BR variants.	Established the strongest open pt-PT encoder baseline and treated European Portuguese as its own target.	2305.06721	HTML	Model; GLUE-PTPT dataset	PORTULAN	Not verified
Albertina PT family expansion	2024	100M, 900M, 1.5B	Expanded family of open Portuguese encoders at multiple sizes.	Turned Albertina into an ecosystem rather than a single model.	2403.01897	HTML	1.5B pt-PT model	PORTULAN	Not verified
Gervásio PT family	2024	7B, 8B, 70B	Fully open instruction-tuned decoder model for Portuguese, with pt-PT and pt-BR variants.	One of the earliest serious open Portuguese decoder-side models from Portugal.	2402.18766	HTML	7B pt-PT model	PORTULAN	Not verified
GlórIA	2024	1.3B, 2.7B	Open generative LLM for Portuguese with strong European Portuguese orientation.	Pushed pt-PT into the decoder / LLM era and introduced CALAME-PT.	2402.12969	HTML	Model; CALAME-PT dataset	NOVA-vision-language	rvlopes/GlorIA
MediAlbertina	2024	900M, 1.5B	Domain-adapted European Portuguese medical language model built on Albertina.	Brought pt-PT modeling into the medical domain.	Not verified	Not verified	HF model	portugueseNLP	Not verified
AMALIA	2026	9B	Fully open pt-PT-first LLM paired with native pt-PT evaluation.	Current flagship for European Portuguese LLM work.	2603.26511	HTML	Not verified	Not verified	AMALIA-LLM/AMALIA

Project	Year	What it is	What it advanced	ArXiv paper	ArXiv HTML	Hugging Face model / dataset	Hugging Face org	GitHub repo
From Brazilian Portuguese to European Portuguese	2024	pt-BR → pt-PT translation study with a manually curated gold test set.	Created reusable native evaluation material for an important pt-PT correction task.	2408.07457	HTML	Not verified	Not verified	Not verified
PtBrVarId	2025	Cross-domain dataset for distinguishing European and Brazilian Portuguese.	Improved the curation pipeline needed to separate pt-PT from pt-BR across datasets and models.	2502.14394	HTML	Dataset; Model	liaad	LIAAD/portuguese_vid
Tradutor / PTradutor	2025	Open European Portuguese translation model plus dedicated parallel dataset.	Made pt-BR → pt-PT translation an open, reproducible research problem.	2502.14385	HTML	PTradutor dataset	hugosousa	hmosousa/tradutor; hmosousa/ptradutor
CitiLink-Minutes	2026	Multilayer pt-PT dataset of municipal meeting minutes.	Opened a practical civic-language dataset for European Portuguese.	2602.12137	HTML	Representative HF model	inesctec	INESCTEC/citilink-dataset
CitiLink-Summ	2026	pt-PT summarization dataset for discussion subjects in municipal meeting minutes.	Made summarization on real public-administration text possible in pt-PT.	2602.16607	HTML	HF summarization model	inesctec	INESCTEC/citilink-summ
ClaimPT	2026	European Portuguese dataset for claim detection in news.	Gave pt-PT fact-checking a proper research base using licensed news content.	2601.19490	HTML	HF model	lfcc	LIAAD/ClaimPT

Project	Year	What it is	What it advanced	ArXiv paper	ArXiv HTML	Hugging Face model / dataset	Hugging Face org	GitHub repo
DSL-TL / Language Variety Identification with True Labels	2023	Human-annotated benchmark for language variety identification, including pt-PT vs pt-BR.	Fixed a common evaluation flaw: assuming the source of a text reveals its variety.	2303.01490	Not verified	Not verified	Not verified	LanguageTechnologyLab/DSL-TL
CALAME-PT	2024	Portuguese zero-shot language-model benchmark introduced with GlórIA.	Gave Portuguese decoder models a shared evaluation surface.	GlórIA paper	HTML	HF dataset	NOVA-vision-language	rvlopes/GlorIA
CAMÕES	2025	Open benchmark for European Portuguese ASR and other Portuguese varieties.	Gave European Portuguese speech recognition a serious open benchmark.	2508.19721	HTML	HF dataset	inesc-id	Not verified
ALBA	2026	Linguistically grounded pt-PT benchmark for generative LLMs across eight linguistic dimensions.	Moved pt-PT evaluation toward native linguistic and cultural fidelity instead of translated proxies.	2603.26516	HTML	Not verified	Not verified	AMALIA-LLM/alba-benchmark
CLARIN-PT-LDB	2026	Open pt-PT LLM leaderboard focused on language, culture, and civility.	Made pt-PT LLM evaluation public and reproducible.	2603.12872	HTML	HF Space	PORTULAN	Not verified
AMALIA Eval Suite	2026	Native pt-PT evaluation suite released inside the AMALIA technical report.	Strengthened the move from translated benchmarks to native pt-PT evaluation.	2603.26511	HTML	Not verified	Not verified	AMALIA-LLM/AMALIA