How would you go about building somthing like this? Where would you host them?
I was thinking appfog for node.js to handle incoming http request and rabbitmq to queue asynchronous requests, reddis to support data.
Goals:
1. Whenever receiving a request for processing, you should return a task identifier that can be used to check for progress (ex. you could simply store this in redis as 'TaskId', 'Url', 'FileId', 'Type', 'Status', 'OutputId' - with Type being the type of input: URL or FileId)
2. Once the client received the TaskId, it can do a request to get the status, which would could be 'Pending' or 'Complete' with an URL to the output.
Process:
===================
Request Processing:
-------------------
1. Receive the request with parameters (can users upload files or simply pass through an URL?)
2. Create a token based on the hash of the input file and the parameters supplied (ex. MD5 hash of the video, from time, end time of the video to capture).
3. Check if a task identifier has already been assigned to the token, if it has, return it. This means the client will then do a request to check for the status and would receive { 'Status': 'Complete', 'URL': '...' } and can immediately access the output via the URL.
4. If the video is uploaded, check whether it has been saved via hashing the file (ex. MD5), if it has then use the existing FileId, otherwise, save the file and generate a new FileId.
5. Generate a new task identifier and create a task entry to store in redis that contains metadata about the task.
6. Create a command containing the task identifier and publish it to the message queue for processing (which is asynchronous).
7. Return the task identifier to the client.
Task Processing:
-------------------
To process the video, you would simply grab the metadata from redis and update the progress as you go along. If the task identifier doesn't exist (no metadata entry on redis), simply abort the task processing.
Although this is pretty simple, you should focus on optimizing file storage (ea. don't store the same file twice and reference them by MD5 hash or the like and perhaps have a cleanup routine to remove rarely/once-off used files) and processing time (if the task has already been processed, simply return the output). You should also decide on limitations to prevent abuse (DOS/DDOS attacks, large file uploads, what the service can be used for etc.) and limit your liability.
Is this the answer you were looking for or are you asking about the specifics of implementation (technology wise)?
I see now the thinking flow I have to change, I have been window shopping so to speak with frameworks, stacks.
iron.io is what I think will be used for converting long asynchronous task.
I wouldn't mindd specific technology stacks. it's either node.js or python. I built something in Flask (god bless it's heart) but I don't know if there are others who have made it scalable...
I keep hearing this idea from something like this http://www.slideshare.net/norbu09/rabbitmq-couchdb-awesome
Thanks again, but I need to hit the sack.
Here's what I think of specific 3rd party/FOSS solutions for each of the components.
Before I start, I wonder if anyone has seen that slideshare I posted, it talks about creating single functions and services (daemon), essentially acting as their own islands, connected all through a central RabbitMQ, hooked up to http front end nodejs servers. So this is a summary of my understanding of the subject so far.
Another fact to consider regarding ddos, exploiters. I really don't know, I was thinking rackspace or some kind of front end host that is able to handle it. I mean all in all, the REST web service just returns a simple 200 OK or a media file. No templating (but it might be good to have some templating or user session like Flask <---don't know how to scale).
So far, I love the design process you've laid out. And you are absolutely right, I should approach this with design problem solving set of mind.
I'm going to summarize what I've read, in order to get a feedback and/or bounce off ideas.
I have to tell it like a story because I feel like that's how I understand and remember things efficiently.
Users watching a video, wants to convert this great youtube meme video to an animated gif. he just appents someserviceio.com/http://youtube.com/v?23uf7wg/convertedvideo.gif
the browser loads for the duration. it might be a good idea to put some UI notification here, simple message of what's going on in the background. Once again, I need a way to notify the user in the browser as the event happens in the backend. need a stack suggestion here.
for short videos, when they type it in. it should just load up the convertedvideo.gif animated picture in the browser. but for longer videos, they should receive notification.
I don't know, what do you think?
Ultimately, I do need some sort of security. For example, only have registered users have access to more functions. they could do something like http://someserviceio.com/my@email.com/secretKey/batchConvert...
and then it will call iron.io ironmq or something like that, put in a request to the queue, fire off some ffmpeg scripts in the ironworker asynchronously, receive message that it's successful, send back to the browser animated gif that has been rendered.
During this time, the user will just have to wait, until the task completes. I'm not sure if I would want them to time out because what if (edge case), someone decides to convert one of those "Nyan cat loop for 10 hours" videos, the server is stuck in a long task.
Then comes the question of whether the same long process will continue if some other user wants to test out this exploit. That's where I think persistent storage of the processed file would come in handy. the request url should first be scanned if it's existing in one of our previous url requests.
Redis could be used replacing rabbitmq or ironmq. but I'm think for spike cases, not in a sense that the app will become hugely popular overnight (although making it on the first page of news ycombinator would awesome), but certain users either being passionate users making lot of requests and malicious individuals abusing it.
I really don't want to block the api behind the http auth dialog or a web based login.
I envision it to just work like you would use google, except you can convert youtube videos to animated gifs, just by entering in the url.
Of course in the backend, I'd need extensive monitoring and realtime analytics would be cool.