Fix crawler script truncating data

@RubyFri

@RubyFri informed me that she was having an issue with the data for Colorado. I've looked into it, and I think that this may be with how the analysis.json files are being loaded into the Colab. Essentially, certain fields that are very long are being truncated, which causes them to have invalid JSON.

The urlClassification object for some sites can be very long. Sometimes over 5000 characters, such as for eagletribune.com, fox5atlanta.com, wsbradio.com, and swimswam.com. When opening the file and getting a dataframe for it, it appears as if the data is being truncated at exactly 5000 characters. This causes issues later on, because when this field is read as JSON, you end up getting errors from the string being truncated suddenly.

I've looked at the files themselves in the drive, and there was no truncation there, so the source definitely seems to be somewhere in the Colab. Additionally, I checked the data for a crawl from May (where we didn't have this issue), and the field for these sites was not greater than 5000 characters. I believe that this caused the issue to not be readily apparent until now, when the field has grown.

I've tried checking the length directly after loading the file before pd.DataFrame, and it was already truncated. Additionally, I tried reading the analysis files as raw bytes and then decoding them, but this also didn't work, and they were still truncated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix crawler script truncating data #199

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Fix crawler script truncating data #199

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions