Many internal server error (Aleph-UI - 4.0.2)

Hi,

Im experiencing many internal server error, almost everywhere in the Aleph UI:

  • login
  • creating investigations
  • listing datasets
  • listing notifications
  • etc.

It happens when my aleph session stay open a long time or when trying to relogin. I can’t explain it. The problem is solve when refreshing the aleph UI (ctrl+r).

The following traceback was taken at login.

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1965, in _exec_single_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py", line 921, in do_execute
    cursor.execute(statement, parameters)
psycopg2.OperationalError: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2190, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1486, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.8/dist-packages/flask_cors/extension.py", line 176, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1484, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1469, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/aleph/aleph/views/sessions_api.py", line 140, in oauth_callback
    role = handle_oauth(oauth.provider, oauth_token)
  File "/aleph/aleph/oauth.py", line 92, in handle_oauth
    role = Role.by_foreign_id(role_id)
  File "/aleph/aleph/model/role.py", line 175, in by_foreign_id
    return q.first()
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/query.py", line 2743, in first
    return self.limit(1)._iter().first()  # type: ignore
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/query.py", line 2842, in _iter
    result: Union[ScalarResult[_T], Result[_T]] = self.session.execute(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/session.py", line 2262, in execute
    return self._execute_internal(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/session.py", line 2144, in _execute_internal
    result: Result[Any] = compile_state_cls.orm_execute_statement(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/orm/context.py", line 293, in orm_execute_statement
    result = conn.execute(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1412, in execute
    return meth(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/sql/elements.py", line 516, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1635, in _execute_clauseelement
    ret = self._execute_context(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1844, in _execute_context
    return self._exec_single_context(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1984, in _exec_single_context
    self._handle_dbapi_exception(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 2339, in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1965, in _exec_single_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py", line 921, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

[SQL: SELECT role.foreign_id AS role_foreign_id, role.name AS role_name, role.email AS role_email, role.type AS role_type, role.api_key AS role_api_key, role.is_admin AS role_is_admin, role.is_muted AS role_is_muted, role.is_tester AS role_is_tester, role.is_blocked AS role_is_blocked, role.password_digest AS role_password_digest, role.reset_token AS role_reset_token, role.locale AS role_locale, role.last_login_at AS role_last_login_at, role.id AS role_id, role.deleted_at AS role_deleted_at, role.created_at AS role_created_at, role.updated_at AS role_updated_at 
FROM role 
WHERE role.deleted_at IS NULL AND role.foreign_id = %(foreign_id_1)s 
 LIMIT %(param_1)s]
[parameters: {'foreign_id_1': 'oidc:<UUID>', 'param_1': 1}]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

Debug log:

{
    "logger": "aleph",
    "timestamp": "2024-12-30 08:07:07.063865",
    "exception": "Traceback",
    "v": "4.0.2",
    "ua": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:133.0) Gecko/20100101 Firefox/133.0",
    "begin_time": "2024-12-30T08:07:07.004322",
    "method": "GET",
    "url": "http://my-aleph-instance/api/2/sessions/callback?state=Nwo1glU8X3m1Q9pNBjzpPr50kyIm2k&session_state=4acd2fbd-4a06-4a62-b8f3-0b99796a0d4b&code=05227ab4-b717-48e9-9263-b3555db66d25.4acd2fbd-4a06-4a62-b8f3-0b99796a0d4b.2e7b7a38-d6b6-44d4-bc6b-d38c50a59395",
    "endpoint": "sessions_api.oauth_callback",
    "session_id": null,
    "locale": "fr",
    "referrer": null,
    "role_id": null,
    "trace_id": "f7f2869a-487f-4a8c-a7f0-f4ffaba816b9",
    "path": "/api/2/sessions/callback?state=Nwo1glU8X3m1Q9pNBjzpPr50kyIm2k&session_state=4acd2fbd-4a06-4a62-b8f3-0b99796a0d4b&code=05227ab4-b717-48e9-9263-b3555db66d25.4acd2fbd-4a06-4a62-b8f3-0b99796a0d4b.2e7b7a38-d6b6-44d4-bc6b-d38c50a59395",
    "ip": "",
    "message": "Exception on /api/2/sessions/callback [GET]",
    "severity": "ERROR"
}
{
    "logger": "aleph.views.base_api",
    "timestamp": "2024-12-30 08:07:07.065063",
    "exception": "Traceback",
    "v": "4.0.2",
    "ua": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:133.0) Gecko/20100101 Firefox/133.0",
    "begin_time": "2024-12-30T08:07:07.004322",
    "method": "GET",
    "url": "http://my-aleph-instance/api/2/sessions/callback?state=Nwo1glU8X3m1Q9pNBjzpPr50kyIm2k&session_state=4acd2fbd-4a06-4a62-b8f3-0b99796a0d4b&code=05227ab4-b717-48e9-9263-b3555db66d25.4acd2fbd-4a06-4a62-b8f3-0b99796a0d4b.2e7b7a38-d6b6-44d4-bc6b-d38c50a59395",
    "endpoint": "sessions_api.oauth_callback",
    "session_id": null,
    "locale": "fr",
    "referrer": null,
    "role_id": null,
    "trace_id": "f7f2869a-487f-4a8c-a7f0-f4ffaba816b9",
    "path": "/api/2/sessions/callback?state=Nwo1glU8X3m1Q9pNBjzpPr50kyIm2k&session_state=4acd2fbd-4a06-4a62-b8f3-0b99796a0d4b&code=05227ab4-b717-48e9-9263-b3555db66d25.4acd2fbd-4a06-4a62-b8f3-0b99796a0d4b.2e7b7a38-d6b6-44d4-bc6b-d38c50a59395",
    "ip": "",
    "message": "InternalServerError: 500 Internal Server Error: The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.",
    "severity": "ERROR"
}

Have you already seen this error?

Stack:

  • aleph:4.0.2
  • ingest-file:4.0.2
  • aleph-ui-production:4.0.2

It almost looks like the database migrations didn’t run through. Depending on the setup you are running you need to run make upgrade (see Development Environment – Aleph)

Hi,

I already perform an aleph upgrade and the error occur when deploying a new stack in version 4.0.2 (i.e without doing an upgrade 4.0.0 to 4.0.2 for example).

I just test it and got another 500 just now

Hi,
Got the same issue today when trying to login. Is it possible that the problem come from cookies or something that is cached ?

When refreshing the problem is solve so it’s not an upgrade inconsistency, is it?

The problem is that I want to automate some ingestion and randomly when requesting the API I got 500 error. Here 10 ingestion work then the 11th return me an error of the same type:

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

[SQL: SELECT pg_catalog.pg_class.relname 
FROM pg_catalog.pg_class JOIN pg_catalog.pg_namespace ON pg_catalog.pg_namespace.oid = pg_catalog.pg_class.relnamespace 
WHERE pg_catalog.pg_class.relname = %(table_name)s AND pg_catalog.pg_class.relkind = ANY (ARRAY[%(param_1)s, %(param_2)s, %(param_3)s, %(param_4)s, %(param_5)s]) AND pg_catalog.pg_table_is_visible(pg_catalog.pg_class.oid) AND pg_catalog.pg_namespace.nspname != %(nspname_1)s]
[parameters: {'table_name': 'ftm_collection_15', 'param_1': 'r', 'param_2': 'p', 'param_3': 'f', 'param_4': 'v', 'param_5': 'm', 'nspname_1': 'pg_catalog'}]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

EDIT:
I think that there is an inconsistency in your aleph upgrade script or maybe I did something wrong in the upgrade process. I just deploy a fresh new instance for test and I dont see any 500 for now.

Hi o/

Happy new year to the Aleph team !

I think this is a database connection error. When checking postgres log I found many Connection reset by peer. By doing a little Googling, I found people who had the same type of error. The solution was to add the option pool_pre_ping (see: docs to the SQLAlchemy configuration.

This parameters is used to first test if the db session is still valid before doing the request.

I managed to add in aleph/core.py

SQLALCHEMY_ENGINE_OPTIONS = {"pool_pre_ping": True}
db = SQLAlchemy(engine_options=SQLALCHEMY_ENGINE_OPTIONS)

For now, I haven’t seen any 500 errors since I made this change. Maybe you want to integrate that in your code?

Update:

We continue to have several errors on our instance. Same error as mentionned before.

Is someone has encountered same issues ?

Is there anything in the database logs? It looks like the connection to the FTM_STORE database is breaking off.

Yes, we get several connection reset from the FTM_STORE but we dont know why

could not receive data from client: Connection reset by peer

It’s hard to figure out what exactly is wrong without more details. Either there are some log messages of interest (especially stacktraces) or you can give some more details on your setup, how many workers you have, any settings you tweaked.

If you feel like you could also try an upgrade to aleph 4.1.0 and ingest-file 4.1.0

Hi,
How did you configure multiple workers ? just to be sure
I deployed 4 replicas of workers on the stack, does aleph manage this replication of workers?
Maybe it is better to assign more thread to 1 unique worker ?

Deploying more aleph-workers or ingest-file instances makes them automatically available to pick up work. Aleph does not “know” how many are around and doesn’t manage them, but the task queue will distribute the load.

We have found scaling to be easier by scaling processes, but you can try the threading support as well. By default it is turned off and documented here.

Thank you for that, I will try both to see how it goes :slight_smile:

1 Like