Modify custom filters of tracking log
A filter was created in #101 (closed) that allows us to modify the lines that end up in the tracking log.
We need to modify that filter so that the lines in the tracking log contain the least amount of identifying information we need. Currently a line looks like this:
2022-06-14 13:23:45,940 INFO 144 [tracking] [user None] [ip x.x.x.x] logger.py:41 - {"name": "/api/user/v1/account/login_session/", "context": {"user_id": null, "path": "/api/user/v1/account/login_session/", "course_id": "", "org_id": ""}, "username": "", "session": "28790d10905791760865f712b543bcaf", "ip": "x.x.x.x", "agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0", "host": "local.overhang.io", "referer": "http://local.overhang.io/login?next=%2Fcourses", "accept_language": "en-US,en;q=0.5", "event": "{\"GET\": {}, \"POST\": {\"email\": [\"admin@totem-project.org\"], \"password\": \"********\"}}", "time": "2022-06-14T13:23:45.939955+00:00", "event_type": "/api/user/v1/account/login_session/", "event_source": "server", "page": null}
As you can see, #101 (closed) already replaced the IP address with x.x.x.x
. But we still have the following data:
-
user
- In this example line the "user" is
None
, but that's because it's a tracking log of a failed login attempt (the first thing I had available when making this issue) - We want to replace usernames with something non-identifiable. Question: Does it even matter for us, or for Cairn, what username did something? Maybe we can replace all usernames with a bogus value (
x
) without it affecting the data we collect. Second best option is to pseudonimize/anonymize the username. We need to research the best ways to do that
- In this example line the "user" is
-
name
-- This is the URL that was tracked. This value can stay as it is -
context
-
user_id
-- Should be removed or anonymized, likeuser
above -
path
-- no change -
course_id
-- no change -
org_id
-- no change
-
-
username
-- should be anonymized/pseudonimized, same asuser
field -
session
-- Session ID can be linked to a browser cookie. Not sure if we want to keep that data: if we would replace all username/user IDs with justx
and could still track individual sessions based on the session ID, that would be pretty neat. In that case we probably still want to hash the session ID so at least it can't directly be linked to a session cookie on somebody's PC. -
ip
-- preferably fully anonimyzed (like it is after #101 (closed)), but we could also keep a part of the IP if that has some kind of value. -
agent
-- I see no reason to keep this, unless we want to know what device people use to access a course. -
host
-- no change -
referer
-- no change -
accept_language
-- no change -
event
-- maybe we want to take the email address out of the log here.