Forcing Download of Historical Forum Data Through Web Connector

BrianJ · August 3, 2021, 4:35pm

All,

OK, this is a bit of an oddball question, but I’m hoping someone here can help me figure this out.

What I am trying to do is download all the records from the forum to do some analysis of usage patterns. I am working with a Data Pipeline company who has developed a custom connector based on the Discourse API. It works beautifully, pulling all the fields I need, BUT it can’t do a full historical data pull - it can only snapshot the current day’s data, which means it will take me a year or more to build the dataset I need.

So, I traveled two years back in the Wayback Machine and found this post from @Melissa, which does something similar, by pulling the data through the web connector in PQ. However, her pull was limited to 30 rows as well. I suspect this is due to the fact that on the Latest Topic pages, it only loads a limited amount of data and you have to scroll to get it to load more.

Does anyone have any ideas for how I could force a pull of posts beyond the 30 record/first load screen point? I’m not committed to doing it in PQ w/ the web connector - R, Python, etc. would be fine (anything the gets me the historical data).

Thanks much.

Brian

Melissa · August 3, 2021, 8:08pm

Hi @BrianJ,

I browsed through the HTML and found this, now that looks helpful

You’ll still have to do some work but I hope this gets you started. Here’s my sample.
eDNA - Forum Topics.pbix (88.0 KB)

@BrianJ , yup wrong file just uploaded the right one (I hope…)

BrianJ · August 3, 2021, 8:49pm

@Melissa ,

Thanks! But I think you might have attached the wrong PBIX.

However, based on your screenshot, I now think I know how to make this work - one column table with the est. # of pages, custom function based on (pages as number ) as table => , brute force invoke iterating over 10,000+ pages here we come…

I’ll let you know how it goes. Thanks!!!

Brian

BrianJ · August 3, 2021, 11:58pm

@Melissa ,

This is fantastic. You actually went a lot further toward the full solution than I expected.

Just one final question – in the 2019 version you are able to extract the names of the names of the contributors to each post. However, in the current iteration I can’t seem to extract that information. I’m thinking that perhaps Discourse has changed the the structure of the HTML tables so that that information can no longer be downloaded in the same way. Am I missing something with regards to how to scrape that information relative to how you did it prior?

Thanks so much for your help.

– Brian

Melissa · August 4, 2021, 11:17am

Hi @BrianJ,

I got that sorted but it requires a separate call for each topic, so we basically went from ‘terrible slow’ (getting 30 topics in one call) to ‘unacceptably slow’ (making an additional call for each individual topic as well )…

I’ve added some other fields instead and will leave it here for you, to compare the two…
eDNA - Forum Topics.pbix (13.0 KB)

I hope this is helpful

BrianJ · August 5, 2021, 3:42pm

@Melissa ,

This is awesome - exactly what I need, and I’m not even bothered by “unacceptably slow”. I think if I just do one massive historical pull, then I can use the Dataddo Discourse API Connector for the daily snapshots that I can use to update the base data pull incrementally.

However, I do have one more question. How can you tell what other fields are available to pull, since when I initially connected via web, even your Users fields didn’t show up? If I’m going to do a huge, slow pull, I’d love to be able to include all or almost all of the other availalble fields.

Thanks so much for your help!

Brian

Melissa · August 5, 2021, 6:58pm

I used the “Add Table Using Examples” functionality to discover the identifiers for those fields.