chaps 9 months ago

I do work with "open data" on a near-obsessive basis and -- friends, please do not trust "open data" portals to reflect reality accurately. The datasets are often curated, categories changed during the ETL processes, rows missing, and things like that. For example, Chicago's "crimes" dataset intentionally doesn't include all homicides. Can't remember the exact dataset, but I once had a conversation with Chicago's head of open data who told me that they intentionally removed many rows because they were concerned that the public was going to misinterpret the results... but didn't make it clear that rows were missing. So I guess everybody gets the opportunity to misinterpret the results!

FOIA is the better alternative because it gives you the original, pre-cleaned data. Open data is a lie.

  • amy-petrik-214 9 months ago

    Hah that's classic politics "Hello John Q. Public, here's all our data! It speaks for itself" John Q. Public: "Wow, you really improved last few years homicide-wise" "And so you see, a third party unrelated to us has just confirmed what a great job we're doing with simple empirical, evidence-based governance!"

    So that means what you want to do is specialize in identifying bias in these datasets and finding the smoking gun. Such a task can be an ugly business but necessary for the public good, pushing data sharers to either share good data, or not share, but not share tricksy data in this unethical way.

  • stevage 9 months ago

    I worked in open data for quite a few years. This is a very weird take.

    Open data portals generally have data is useful form. FOI probably gives you PDFs.

    • chaps 9 months ago

      "FOI probably gives you PDFs."

      Having submitted thousands of FOIA requests, I get the impression that you haven't, actually, submitted many FOIA requests. I've received many, many, many, many non-PDF FOIA responses.

      Share me some of the open data you've worked with and I'd love to poke at it and tell you where it's wrong and where assumptions about its data is wrong.

      • 0cf8612b2e1e 9 months ago

        Thousands?! Do you have a public list on everything?

        Have you had to fight a lot of malicious compliance which balloons up your request count? Or do they typically require an incredibly narrow request that you have up submit N entries per topic?

      • stevage 9 months ago

        What an unappealing offer. No thanks.

        • chaps 9 months ago

          Not any different from being red-team'd, but you do you. But thanks for your input -- it makes more sense that your apparent reluctance to be challenged makes it clear why you think my take is odd.

          Even still, I challenge you to challenge yourself to understand where your blind spots are. I've done it many times and have found significant problems with the open datasets I've worked with. If you think my take is weird, it's only because you're not looking or the data you're looking at is inconsequential.

          To me, this stuff is literal life and death. If we make mistakes in our analysis because of misinformation from the source, then the lives and deaths of people we're trying to understand becomes tarnished. We can treat our neighbors better than that.

          • stevage 9 months ago

            >your apparent reluctance to be challenged

            There are lots of reasons someone doesn't want to be "challenged" by some blowhard on the internet. One of them, true in my case, is I don't even work in this area anymore, as I said in my original post.

            I really hope you are nicer in person.

            • chaps 9 months ago

              Fair.

              Can I ask why exactly you think my take is "very weird"?

              Your original post was exceptionally dismissive, without explanation, and your comment on FOI was said so confidently probabilistic that it struck me that you misunderstood what I was suggesting. pardon my aggressive response. I get a lot of similar dismissiveness whenever I interact with government agencies, often where I'm told that something doesn't exist, or "Just look at the data portal", while the data portal is intentionally missing the information I look for. I don't expect you to answer my question, but I hope you can try to understand where I'm coming from in my thoughts and opinions on open data. My intent was only to get you to share your thoughts further.

              • stevage 9 months ago

                Look, for starters, the stuff we're talking about covers a pretty broad spectrum. Your framing of the question about "intentionally missing" stuff suggests that you're interested in transparency-style data: data that gives you insight into the operations of the government body. And yes, if you are looking for data that might reflect poorly on the organisation, an open data portal is generally not the place to go for it.

                This HN item for instance, is not about that kind of data. The datasets in question tell you about the transport network, the services, the patronage, the history, all kinds of interesting stuff.

                So I find it "weird" that you would respond to a good-faith effort of sharing tons of information about a public transport network with this hostile approach of disparaging open data portals, and advocating instead an approach which is extremely resource-intensive for government bodies, when it's completely uncalled for.

                Yeah, if you want to investigate a government cover-up, or shine light on some terrible mismanagement of resources, go for your life and submit FOI requests. Your mention of having filed thousands of FOI requests suggests you have consumed many tens of thousands of hours of public servants' time, and I really hope the results justify it.

                • chaps 9 months ago

                  Lemme tell you a story.

                  Years ago during the pandemic early days, a harvard epidemiology student asked me to proof-read his paper that argued that covid-19 killed more white people than any other race. The dataset he used was the Cook County Medical Examiner dataset. There was a column in there for the race information. If you're curious how it's populated, I can share with you the information.

                  Previously, I'd FOIA'd the data and received many more columns of information including the names of the individuals who'd died which showed a very clear pattern that the race information on the open data portal was not always accurate for Hispanic-origin names. The details are complicated, and I'm happy to explain my fact checking methods, but the Harvard student's analysis was just flat wrong because it made assumptions that the race data was correct. It was not.

                  Their response was initially along the lines of, "even if it's 50% it's still going to be true". It ended up being more like 80%, showing that people with Hispanic-origin names were significantly more likely to die of COVID-19.

                  If you think your audience isn't academics at mega institutions who believe that open data is 100% accurate data, then you've made many incorrect assumptions and I encourage you to reconsider.

                  • stevage 9 months ago

                    >your audience

                    "my audience"?

                    What makes you think I have an audience?

                    • chaps 9 months ago

                      I hope you're a nicer person in-person, too.

  • bshep 9 months ago

    Where I grew up the data for murders is curated in such a way that anybody that dies 24 after being attacked is not considered a ‘murder’. Tehy do this to reduce the statistical murder rate.

    • chaps 9 months ago

      Can you say more about this?

  • IanCal 9 months ago

    Although pre-cleaned data is often not reflective of reality and requires careful work to use, often requiring a lot more knowledge of the field.

  • gordon_freeman 9 months ago

    But even if dataset is incomplete or not accurate, do you think we could at least get directionally right insights from such datasets?

    • chaps 9 months ago

      Yes, of course there can be. But I cannot ignore the harms in doing so, by misrepresenting the data in a way that disallows others to understand what is or isn't there -- it happens regularly. These datasets are often used as a political tool and contracted with local universities to show that they're providing data... though not actually providing the accurate data. Simultaneously though, people who don't know data will champion the data as accurate because it comes from a university program.

      Sometimes what can happen is that somebody inexperienced will try to make some assessment of the data and come to the exact wrong conclusion because they didn't know what not to trust. But it gets on the news anyway and damage is done.

      We can do better than that.

whitej125 9 months ago

Would be neat if instead an open-ended challenge ("here's some data, do something cool") the MTA instead shared a list of hypothetical or real problems to solve and provided data that could be potentially useful in the exploration/solution to the problem.

  • maxverse 9 months ago

    Also, considering they just got a 68 billion dollar budget approved [1] over the next 5 years, even a small monetary reward would be nice for this. It doesn't need to be a ton of money, but something other than "here's a piece of empty and memorabilia and we'll write a blog post" would be a good incentive

    [1] https://ny1.com/nyc/all-boroughs/news/2024/09/25/mta-board-a...

    • exegete 9 months ago

      I think you are misinterpreting that article. The MTA board approved the plan to spend $68B but they depend on the state to give them funds. That’s the amount of money they are asking for based on the projects they want to complete. The state government has to pass a budget to fund that plan (or do something else). Additionally several current, already started projects are on hold due to the “pause” of congestion pricing which was going to be a funding source.

  • doctorpangloss 9 months ago

    Why would a cost center political institution enumerate all its problems? It is kind of miraculous they can engage with the public this way at all.

slt2021 9 months ago

I could not find dataset with payroll hours reported and overtime reimbursed for each MTA employee.

I wanted to investigate how well MTA is managing its workforce and compensation (as to require additional tax in form of Congestion Pricing to fix its budget hole), but there seems to be no dataset for that.

Does anyone have links to MTA payroll/hours/overtime related dataset?

or alternatively, I need dataset to study each and every subway improvement project, and components of each project in materials, labor and etc

  • WUMBOWUMBO 9 months ago

    perhaps this could be covered in a FOIA request

stevage 9 months ago

Interesting, these open data challenges were all the rage 10 years ago. Wonder why the sudden trip down memory lane.

nocman 9 months ago

I keep clicking on these 'MTA' articles expecting them to be about a "message transfer agent".

Then I think, oh, right, wrong MTA. Guess I've spent too much time dealing with email servers.

rayrrr 9 months ago

Hold my Metrocard.

asjfkdlf 9 months ago

The prize is very underwhelming. If they really want people to spend effort on it, they need to make the prize worth it.

  • noitpmeder 9 months ago

    Seems perfect actually! Attracts people that are interested in the subject matter, not just a proposed reward.

    • maxverse 9 months ago

      "we're hiring people that really love programming and aren't just in it for the money"

      • 0cf8612b2e1e 9 months ago

        It will look great in your portfolio.

  • xtiansimon 9 months ago

    > “The winner will receive a vintage New York City Transit item from our memorabilia collection.”

    Depends what it is. Long as it’s not something you could steal yourself. Ha!

  • nxobject 9 months ago

    Never underestimate the value of surplus NYC subway memorabilia to a transit enthusiast. Especially signage from retired rolling stock.

  • zeroxfe 9 months ago

    If you're doing it for the prize, then you're not the targeted audience :-)

  • afavour 9 months ago

    IMO it deliberately establishes a tone. This challenge is for rail fans, it’s not a generalised “use our API” hackathon type thing.

    Plus the MTA has a huge budget crunch. I really don’t think they could justify spending money on something with such an unclear outcome.

    • stevage 9 months ago

      Even still it probably cost tens of thousands of dollars of staff time.

  • IncreasePosts 9 months ago

    The prize is being able to say you won the prize on your resume. I assume a lot of college kids in data science are going to be going at this.

  • corytheboyd 9 months ago

    I think it actually sounds kinda cool, if it’s something unique that couldn’t just be purchased!

mcfedr 9 months ago

Why would you region block a webpage like this

  • JumpCrisscross 9 months ago

    > Why would you region block a webpage like this

    As a part-time New York City taxpayer, I'd rather we not be paying EU lawyers to make sure the MTA's open data complies with European law.

    • pc86 9 months ago

      Good news, the EU doesn't have any jurisdiction in NYC (or anywhere else outside of the EU) so they don't have the ability to enforce anything outside of their borders, as much as they would like you to believe otherwise.

      You can enforce what people and companies do within your borders. You cannot enforce what companies or people outside of your borders do.

      • alwa 9 months ago

        That may come as news to sanctioned Russians and various motley crypto types…

        Isn’t the GDPR’s basic theory about jurisdiction that, if I’m sitting in New York City but routinely serving my web content to people in France, that service I’m providing relies on browsing intentions and tracking functions being executed by a user and on a machine in France, and therefore the meat of the “wrongdoing” is happening within their borders?

        You can choose to do that the European way or not at all. And the local contests division of the NYC local transit authority is choosing “not at all.”

        Isn’t this then a case of NYC complying with the EU’s express wishes for privacy by not “exporting” code they don’t want there?

        • pc86 9 months ago

          Aren't most sanctions due to e.g. the US making it illegal for banks with a US presence to do business with sanctioned states/people? I don't think the US is telling some Polish bank that only operates in Poland and Russia that they need to stop doing business Russia, although they may sanction that bank as well if they don't.

          I have no problem with voluntarily complying to GDPR-style privacy regulations because it's the right thing to do. Where I am able to make the decision, we store basically no user data beyond what's required to do whatever the user is trying to do.

          My problem is the EU pretending that US companies must be fully GDPR-compliant because someone in France chooses to go to their website. At the end of the day, laws are only laws because you can enforce them. If I had a magic wand and could rob a bank but the police for some reason were unable to arrest me, the fact that bank robbery is illegal is merely semantic at that point. If I chose to flaunt GDPR non-compliance on a US-based website the EU would be impotent to do anything other than block the site, which wouldn't make me any more likely to suddenly become GDPR compliant.

          It's a fiction and I probably wouldn't care about it nearly as much except it has essentially ruined the public internet with cookie banners everywhere.

          Every time a cookie banner gets displayed on some non-EU resident's personal blog, a puppy dies.

    • remram 9 months ago

      In what circumstance do you imagine NYC tax money would go towards EU lawyers?

  • safeimp 9 months ago

    Reading their terms, I'm guessing it's due to:

    > 3. Eligibility: The Challenge is open to legal residents of the United States. Entrants must be 18 years of age or older as of their date of entry. The Challenge is subject to federal, state, and local laws and regulations and is void where prohibited by law. Employees and contractors of the MTA, its subsidiaries, affiliates, and directors (collectively the “Employees”), as well as members of an Employee’s immediate family and/or those living in the same household, are ineligible to participate in the Challenge.

    • n_plus_1_acc 9 months ago

      You can be a resident of the US and be on vacation for a couple weeks

    • ratedgene 9 months ago

      yeah but wouldn't you want to create enough buzz globally so word of mouth can spread to more US entrants?

      • safeimp 9 months ago

        I don't disagree with you at all, I'm just speculating over why they'd block it.

  • nemo44x 9 months ago

    Because the next thing you know the EU is suing you for billions of Euros.

    • deathanatos 9 months ago

      "Doctor it hurts…", IANAL.

      I mean … as I understand the Europeans' law, only if you're doing dumb things to begin with, like giving users' data away to random 3rd parties hellbent on shoving "ads" down one's throat. If you had just made this site a simple HTML page that just had the information the MTA wanted to convey on it, AIUI the EU doesn't have a problem.

      Which … the MTA does appear to be, sending requests to Google, LinkedIn, and some other CDNs.

      I also don't think the MTA has any EU presence, so what are they going to do?

      • JumpCrisscross 9 months ago

        > as I understand the Europeans' law, only if you're doing dumb things to begin with, like giving users' data away to random 3rd parties hellbent on shoving "ads" down one's throat

        There is a massive difference between complying with the law and proving you comply. (Think: IRS audit.)

        > don't think the MTA has any EU presence, so what are they going to do?

        Send letters. The MTA would be obligated to respond to them, which means legal bills.

        • deathanatos 9 months ago

          > There is a massive difference between complying with the law and proving you comply. (Think: IRS audit.)

          > The MTA would be obligated to respond to them, which means legal bills.

          …why would the MTA be obligated to respond to them? They've no jurisdiction/sovereignty over an American transit agency.

          Why would they audit themselves against laws that don't apply to them? (Again, jurisdiction?) I've never worked for a company that audited itself against every law from every nation on Earth; we complied with the laws where we had a presence and did business.

      • returningfory2 9 months ago

        > ...AIUI the EU doesn't have a problem

        We're talking about a US transit agency. Even thinking about whether the EU has a problem with the agency's website is sort of absurd to begin with.

        • warkdarrior 9 months ago

          Did this US transit agency, MTA, obtain permission from all EU citizens who traveled on the MTA to share their data with the whole world?

          • JumpCrisscross 9 months ago

            > Did this US transit agency, MTA, obtain permission from all EU citizens who traveled on the MTA

            Not how jurisdiction works.

          • returningfory2 9 months ago

            Eh this conversation has nothing to do with people traveling on MTA services. We’re talking about people accessing the MTA website. Two different things.

    • cddotdotslash 9 months ago

      Expect to see more of this, especially when the audience is local/US. IIRC, some newspapers are already doing region blocks. Why should website owners targeting US visitors spend _any_ amount of money making their content comply with asinine regulations (like cookie banners)?

      • cinntaile 9 months ago

        Cookie banners are not a regulation requirement.

        Contrary to what you seem to believe...There were more geoblocks when the EU law went into action a couple of years ago. There are less now.

        • cddotdotslash 9 months ago

          > There were more geoblocks when the EU law went into action a couple of years ago. There are less now.

          Source for that?

          • cinntaile 9 months ago

            My personal experience.

        • kevin_thibedeau 9 months ago

          EU cookie directive predates GDPR. Notices have long been required by that regulation for use of non-essential cookies.

sgtbr1 9 months ago

can someone share the data?

  • manvillej 9 months ago

    what a tragedy, this person never learned how to read.