Scripting file retrieval behind a login?

I have a subscription to a service which provides a weekly report which can only be accessed through my account on the website. Is it possible to script the retrieval of this document with curl or wget? I tried in the past using curl but I couldnt find any examples online that got me close enough to figuring it out.

Anyone have experience with something like this?

When you post your login credentials to the login page via curl, you need to tell curl to use a cookie file so that the SessionID can be stored upon a successful login. That same cookie file (and SessionID) can then be referenced in a subsequent curl command to retrieve your target document.

curl -s https://targetdomain.com/login.html -c cookiefile.txt -d "user=myself&password=secure"
curl -s https://targetdomain.com/data.html -b cookiefile.txt 

Additionally, the user-agent parameter may have to be sufficiently populated to get the HTTP Server to respond with the correct page i.e. tell the server you are a browser

--user-agent "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36"

HTH

2 Likes

Thanks Mark!

It seems like the first curl command isn’t working as I’d hope. I’ve tried it the way you suggest and then I tried it with the --user <username:password> option. Either way the second command doesn’t retrieve any file. I’ve also tried wget after the first curl command.

wget --load-cookies cookiefile.txt "https://canfax.ca/Report/PDFReport.aspx?catalogue=CurrentWeeklyReport&group=Current&report=Current"

The resulting output is just HTML for the login page so it seems like the authentication step isn’t right as the server just redirects to the login page.

As can be seen in that command the URL doesn’t contain an actual file. The website opens a new tab with that URL and the document is shown in the browser window from which you have to then manually choose to download it.

1 Like

When I go to the login page I get a special URL that changes every page request.

The form is pretty basic so it can be seen easily in the source code…

view-source:https://canfax.ca/(X(1)S(4jjhrfkrqaz45re5z10qngha))/Login.aspx?AspxAutoDetectCookieSupport=1

form tag:

<form name="aspnetForm" method="post" action="Login.aspx" onsubmit="javascript:return WebForm_OnSubmit();" id="aspnetForm">

That means it’s expecting the credentials to be sent via POST request to Login.aspx and as there’s no URL path that makes it relative to the current path making it: https://canfax.ca/(X(1)S(4jjhrfkrqaz45re5z10qngha))/Login.aspx

^ but with the unique URL that the page loaded with.

Now you need the right value pairs which come from the form fields assuming JS isn’t changing anything. This is where things get awkward… there’s a bunch of hidden fields that get sent with the login/password.

<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" />
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" />
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwULLTE1NzM1NzQ1MjMPZBYCZg9kFgJmD2QWAmYPZBYCAgMPZBYKAgUPDxYEHgRUZXh0BQdNZW1iZXJzHgdWaXNpYmxlaGRkAhMPDxYCHwFoZGQCFw8PFgQfAAUGTG9nb3V0HwFoZGQCGw9kFgICAw9kFgICAQ9kFgICCw8PFgIfAWdkZAIdDxYCHwAFgQc8ZGl2IGlkPSJmb290ZXIiPg0KICA8ZGl2IGNsYXNzPSJ3cmFwcGVyIj4NCiAgPHVsPg0KICAgICAgPGxpPjxhIGhyZWY9Imh0dHBzOi8vd3d3LmNhbmZheC5jYS9TaXRlTWFwLmFzcHgiPlNpdGUgTWFwPC9hPjwvbGk+DQogICAgICA8bGk+PGEgaHJlZj0iaHR0cHM6Ly93d3cuY2FuZmF4LmNhL1ByaXZhY3kuYXNweCI+UHJpdmFjeTwvYT48L2xpPg0KICAgICAgPGxpPjxhIGhyZWY9Imh0dHBzOi8vd3d3LmNhbmZheC5jYS9MZWdhbC5hc3B4Ij5MZWdhbDwvYT48L2xpPg0KICA8L3VsPgkNCiAgPHAgaWQ9ImNvcHlyaWdodCI+Q29weXJpZ2h0ICZjb3B5OyAyMDA4IENhbmZheCBDYW5hZGE8L3A+CQ0KICA8ZGl2IGNsYXNzPSJ2Y2FyZCI+DQogICAgPGRpdj4NCiAgICAgICAgPGEgaHJlZj0iaHR0cHM6Ly93d3cuY2FuZmF4LmNhL01haW4uYXNweCI+Q2FuZmF4PC9hPg0KICAgICAgICA8ZGl2IGNsYXNzPSJhZHIiPg0KICAgICAgICAgICAgPHNwYW4gY2xhc3M9ImV4dGVuZGVkLWFkZHJlc3MiPiMxODA8L3NwYW4+LCA8c3BhbiBjbGFzcz0ic3RyZWV0LWFkZHJlc3MiPjY4MTUgODxzdXA+dGg8L3N1cD4gU3RyZWV0IE5FPC9zcGFuPi4gPHNwYW4gY2xhc3M9ImxvY2FsaXR5Ij5DYWxnYXJ5PC9zcGFuPiwgPHNwYW4gY2xhc3M9InJlZ2lvbiI+QWxiZXJ0YTwvc3Bhbj4gPHNwYW4gY2xhc3M9InBvc3RhbC1jb2RlIj5UMkUgN0g3PC9zcGFuPjwvZGl2Pg0KICAgIDxkaXY+PHNwYW4gY2xhc3M9InRlbCI+PHNwYW4gY2xhc3M9InR5cGUiPlRlbDwvc3Bhbj46ICg0MDMpIDI3NS01MTEwPC9zcGFuPiA8c3BhbiBjbGFzcz0idGVsIj48c3BhbiBjbGFzcz0idHlwZSI+RmF4PC9zcGFuPjogKDQwMykgMjc1LTY5NDM8L3NwYW4+PC9zcGFuPjwvZGl2PjwvZGl2PgkNCiAgPC9kaXY+DQo8L2Rpdj5kZEfoqcyUizZFMBT3y93hHW+3Fa1H" />
<input type="hidden" name="__VIEWSTATEGENERATOR" id="__VIEWSTATEGENERATOR" value="C2EE9ABB" />
<input type="hidden" name="__EVENTVALIDATION" id="__EVENTVALIDATION" value="/wEWBALCsJqsDQLfworUBALDyorRAgKzzeeyCpz6KqfbIK6XwsA3z83FApfQk3yI" />
<input name="ctl00$ctl00$ctl00$MainContent$SubContent$SubContent$userNameTextBox" type="text" id="ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_userNameTextBox" style="width:150px;" />
<input name="ctl00$ctl00$ctl00$MainContent$SubContent$SubContent$passwordTextBox" type="password" id="ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_passwordTextBox" style="width:150px;" />
<input type="submit" name="ctl00$ctl00$ctl00$MainContent$SubContent$SubContent$signInButton" value="Sign In" onclick="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(&quot;ctl00$ctl00$ctl00$MainContent$SubContent$SubContent$signInButton&quot;, &quot;&quot;, true, &quot;&quot;, &quot;&quot;, false, false))" id="ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_signInButton" class="add" />

__EVENTVALIDATION for example may be a one-time code, a nonce for the day or something that can be used more than once. Best way to do this would be pulling those values out of the login page programatically to produce your curl POST every time you want to login and curl can also return the special url.

Confirming this against Chromium’s Network tab with a test submit, the curl request in it’s most basic form would look something like this where “USER” is your username, “PASS” is your password and the other values are relevant to the URL and field values of the most recent visit to the login page.

curl \
	--user-agent 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36' \
	-c cookie.txt \
	-d '__EVENTTARGET:' \
	-d '__EVENTARGUMENT:' \
	-d '__VIEWSTATE:/wEPDwULLTE1NzM1NzQ1MjMPZBYCZg9kFgJmD2QWAmYPZBYCAgMPZBYKAgUPDxYEHgRUZXh0BQdNZW1iZXJzHgdWaXNpYmxlaGRkAhMPDxYCHwFoZGQCFw8PFgQfAAUGTG9nb3V0HwFoZGQCGw9kFgICAw9kFgICAQ9kFgICCw8PFgIfAWdkZAIdDxYCHwAFgQc8ZGl2IGlkPSJmb290ZXIiPg0KICA8ZGl2IGNsYXNzPSJ3cmFwcGVyIj4NCiAgPHVsPg0KICAgICAgPGxpPjxhIGhyZWY9Imh0dHBzOi8vd3d3LmNhbmZheC5jYS9TaXRlTWFwLmFzcHgiPlNpdGUgTWFwPC9hPjwvbGk+DQogICAgICA8bGk+PGEgaHJlZj0iaHR0cHM6Ly93d3cuY2FuZmF4LmNhL1ByaXZhY3kuYXNweCI+UHJpdmFjeTwvYT48L2xpPg0KICAgICAgPGxpPjxhIGhyZWY9Imh0dHBzOi8vd3d3LmNhbmZheC5jYS9MZWdhbC5hc3B4Ij5MZWdhbDwvYT48L2xpPg0KICA8L3VsPgkNCiAgPHAgaWQ9ImNvcHlyaWdodCI+Q29weXJpZ2h0ICZjb3B5OyAyMDA4IENhbmZheCBDYW5hZGE8L3A+CQ0KICA8ZGl2IGNsYXNzPSJ2Y2FyZCI+DQogICAgPGRpdj4NCiAgICAgICAgPGEgaHJlZj0iaHR0cHM6Ly93d3cuY2FuZmF4LmNhL01haW4uYXNweCI+Q2FuZmF4PC9hPg0KICAgICAgICA8ZGl2IGNsYXNzPSJhZHIiPg0KICAgICAgICAgICAgPHNwYW4gY2xhc3M9ImV4dGVuZGVkLWFkZHJlc3MiPiMxODA8L3NwYW4+LCA8c3BhbiBjbGFzcz0ic3RyZWV0LWFkZHJlc3MiPjY4MTUgODxzdXA+dGg8L3N1cD4gU3RyZWV0IE5FPC9zcGFuPi4gPHNwYW4gY2xhc3M9ImxvY2FsaXR5Ij5DYWxnYXJ5PC9zcGFuPiwgPHNwYW4gY2xhc3M9InJlZ2lvbiI+QWxiZXJ0YTwvc3Bhbj4gPHNwYW4gY2xhc3M9InBvc3RhbC1jb2RlIj5UMkUgN0g3PC9zcGFuPjwvZGl2Pg0KICAgIDxkaXY+PHNwYW4gY2xhc3M9InRlbCI+PHNwYW4gY2xhc3M9InR5cGUiPlRlbDwvc3Bhbj46ICg0MDMpIDI3NS01MTEwPC9zcGFuPiA8c3BhbiBjbGFzcz0idGVsIj48c3BhbiBjbGFzcz0idHlwZSI+RmF4PC9zcGFuPjogKDQwMykgMjc1LTY5NDM8L3NwYW4+PC9zcGFuPjwvZGl2PjwvZGl2PgkNCiAgPC9kaXY+DQo8L2Rpdj5kZEfoqcyUizZFMBT3y93hHW+3Fa1H' \
	-d '__VIEWSTATEGENERATOR:C2EE9ABB' \
	-d '__EVENTVALIDATION:/wEWBAKE1KiUCQLfworUBALDyorRAgKzzeeyCmJgAbNjaT4NY0KjcXEpwVaroYMV' \
	-d 'ctl00$ctl00$ctl00$MainContent$SubContent$SubContent$userNameTextBox:USER' \
	-d 'ctl00$ctl00$ctl00$MainContent$SubContent$SubContent$passwordTextBox:PASS' \
	-d 'ctl00$ctl00$ctl00$MainContent$SubContent$SubContent$signInButton:Sign In' \
	https://canfax.ca/(X(1)S(4jjhrfkrqaz45re5z10qngha))/Login.aspx

I’m afraid this may take quite a bit of grind work to automate if the server cares about these values.

1 Like

Your login attempt is failing and @Ulfnic is on the right track by including the hidden files of the login page. Without those hidden files, validation fails and no login credentials are recognized.

Yeah, the FF dev tools show these values. I’ve copy-pasted the raw POST request payload into my curl command but it’s a no go. There’s a few values that change every time (I think) so using them in my curl command won’t work since they’ve been used already __VIEWSTATE, __VIEWSTATEGENERATOR, __EVENTVALIDATION

Sometimes it seems like those values are the same between two different logins but I think that could be because it’s reusing a valid cookie. I’ll check this out in more detail but I don’t know what I’m supposed to do with those values that change.

EDIT:
Ok, I found this blog: Using CURL for ASPX with VIEWSTATE and read some of the comments. I can possibly acquire the values with a GET request and then send that back in the POST request with the other login credentials etc.

If you visit the login page in your browser you’re given those values prior to the POST request and that doesn’t prevent login so I don’t think “used” is the right term. It’s information bound to something that’s not present in the curl request but is present in the browser’s request.

Taking a guess, one of those values probably contains a hash of the user-agent, probably combined with other information like IP. You’d want to match curl’s user-agent to the one used by your installation of Firefox or better yet… use curl to visit the login page to get the special URL it redirects you to and how to populate the POST. You can grep for <form and name= to make life easier.

A few alternatives…

  • Automating through something terminal friendly like links, lynx, browsh ect.
  • Launching a browser from shell to the login page where a browser extension can automatically perform the login and deliver the information you need back to the shell.
  • Using puppet with nodejs which launches a chromium browser and controls it remotely. This could be made to work with BASH over named pipes or stdout of 1-time operations if you wanted to keep things as BASH as possible.

I am getting somewhere I think. I wasn’t using curl -L at first so that “special” URL wasn’t being shown to me.

I’m using the __VIEWSTATE, __VIEWSTATEGENERATOR, __EVENTVALIDATION values that curl finds when visiting the direct special URL but the cookie file is completely empty. I used to get something like this:

# Netscape HTTP Cookie File
# https://curl.se/docs/http-cookies.html
# This file was generated by libcurl! Edit at your own risk.

canfax.ca	FALSE	/	FALSE	0	AspxAutoDetectCookieSupport	1

Now all I see is the boilerplate with no URL info or anything else.

I also tried Lynx and it highlighted a problem. Lynx reports that when selecting the link to the file I want invalid URL scheme! I think this most current document is generated on the fly from various other sources, or something like that somehow, and there’s no actual file until it’s displayed in your browser and you save or print it. I found a workaround I think though.

I decided to just use Lynx’s built in scripting functionality and it works well. Would have been cool to get this working with curl just to say I did but I learned a lot along the way so that’s a win too.

1 Like

Did you consider using the urllib module in a Python script? Python has pretty good API and URL scripting capabilities.

Thanks @Mr_McBride. This project has gone a bit stale but I will check that out. I need to spruce up my python anyways.